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"Remanso de rio largo, viola da soliddo: 
Quando vou p 'ra dar hatalha, convido meu coragdo. " 
Gentle backwater of wide river, fiddle to solitude: 
When going to do battle, I invite my heart. 

Joao Guimaraes Rosa (1908-1967). 

Grande Sertao, Veredas. 



'Sertdo e onde o homem tern de ter a dura nuca e a mdo quadrada. 

(Onde quern manda e forte, com astucia e com cilada.) 

Mas onde e bobice a qualquer resposta, 
e ai que a pergunta se pergunta." 
"A gente vive repetido, o repetido... 
Digo: o real ndo estd na saida nem na chegada: 
ele se dispoem para a gente e no meio da travessia. " 

Sertao is where a man's might must prevail, 
where he has to be strong, smart and wise. 

But where every answer is wrong, 
there is where the question asks itself. 
We live repeating the reapeated... 
I say: the real is neither at the departure nor at the arrival: 
It presents itself to us at the middle of the journey. 
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Preface 



"Life is like riding a bicycle. 
To keep your balance you must keep moving. 

Albert Einstein. 



The main goals of this book are to develop an epistemological framework based on 
Cognitive Constructivism, and to provide a general introduction to the Full Bayesian 
Significance Test (FBST). The FBST was first presented in Pereira and Stern (1999) as 
a coherent Bayesian method for accessing the statistical significance of sharp or precise 
statistical hypotheses. A review of the FBST is given in the appendices, including: 

a) Some examples of its practical application; 

b) The basic computational techniques used in its implementation; 

c) Its statistical properties; 

d) Its logical or formal algebraic properties; 

The items above have already been explored in previous presentations and courses given. 
In this book we shall focus on presenting 

e) A coherent epistemological framework for precise statistical hypotheses. 

The FBST grew out of the necessity of testing sharp statistical hypothesis in several 
instances of the consulting practice of its authors. By the end of the year 2003, various 
interesting applications of this new formalism had been published by members of the 
Bayesian research group at IME-USP, some of which outperformed previously published 
solutions based on alternative methodologies, see for example Stern and Zacks (2002). In 
some applications, the FBST offered simple, elegant and complete solutions whereas alter- 
native methodologies offered only partial solutions and / or required convoluted problem 
manipulations, see for example Lauretto et al. (2003). 

The FBST measures the significance of a sharp hypothesis in a way that differs com- 
pletely from that of Bayes Factors, the method of choice of orthodox Bayesian statistics. 
These methodological differences fired interesting debates that motivated us to investi- 
gate more thoroughly the logical and algebraic properties of the new formalism. These 
investigations also gave us the opportunity to interact with people in communities that 
were interested in more general belief calculi, mostly from the areas of Logic and Artificial 
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Intelligence, see for example Stern (2003, 2004) and Borges and Stern (2007). 

However, as both Orthodox Bayesian Statistics and Prequentist Statistics have their 
own well established epistemological frameworks, namely. Decision Theory and Popperian 
Falsificationism, respectively, there was still one major gap to be filled: the establishment 
of an epistemological framework for the FBST formalism. Despite the fact that the daily 
practice of Statistics rarely leads to epistemological questions, the distinct formal proper- 
ties of the FBST repeatedly brought forward such considerations. Consequently, defining 
an epistemological framework fully compatible with the FBST became an unavoidable 
task, as part of our effort to answer the many interesting questions posed by our col- 
leagues. 

Besides compatibility with the FBST logical properties, this new epistemological 
framework was also required to fully support sharp (precise or lower dimensional) sta- 
tistical hypothesis. In fact, contrasting with the decision theoretic epistemology of the 
orthodox Bayesian school, which is usually hostile or at least unsympathetic to this kind 
of hypothesis, this new epistemological framework actualy puts, as we will see in the 
following chapters, sharp hypothesis at the center stage of the philosophy of science. 



Cognitive Constructivism 

The epistemological framework chosen to the aforementioned task was Cognitive Con- 
structivism, as presented in chapters 1 to 4, and constitute the core lectures of this 
course. The central epistemological concept supporting the notion of a sharp statisti- 
cal hypothesis is that of a systemic eigen-solution. According to Heinz von Foerster, 
the four essential attributes of such eigen-solutions are: discreteness (sharpness), stabil- 
ity, separability (decoupling) and composability. Systemic eigen-solutions correspond to 
the "objects" of knowledge, which may, in turn, be represented by sharp hypotheses in 
appropriate statistical models. These are the main topics discussed of chapter 1. 

Within the FBST setup, the e-value of a hypothesis, H, defines the measure of its 
Epistemic Value or the Value of the Evidence in support of H, provided by the observa- 
tions. This measure corresponds, in turn, to the "reality" of the object described by the 
statistical hypothesis. The FBST formalism is reviewed in Appendix A. 

In chapter 2 we delve into this epistemological framework from a broader perspective, 
linking it to the philosophical schools of Objective Idealism and Pragmatism. The general 
approach of this chapter can be summarized by the "wire walking" metaphor, according 
to which one strives to keep in balance at a center of equilibrium, to avoid the dangers of 
extreme positions that are faraway from it, see Figure J.l. In this context, such extreme 
positions are related to the epistemological positions of Dogmatic Realism and Solipsistic 
Subjectivism. 



PREFACE 



15 



Chapters 3 and 4 relate to another allegory, namely, the Bicycle Metaphor: In a bike, 
it is very hard to achieve a static equilibrium, that is, to keep one's balance by standing 
still. Fortunately, it is easy to achieve a dynamic equilibrium, that is, to ride the bike 
running forward. In order to keep the bike running, one has to push the left and right 
pedals alternately, which will inevitably result in a gentile oscillation. Hence a double 
(first and second order) paradox: In order to stay in equilibrium one has to move forward, 
and in order to move forward one has to push left and right of the center. Overcoming 
the fear generated by this double paradox is a big part of learning to ride a bike. 

Chapters 3 and 4 illustrate realistic and idealistic metaphorical pushes in the basic 
cycle of the constructivist epistemic ride. They work like atrial and ventricular systoles 
of a hart in the life of the scientific system. Prom an individual point of view, these 
realistic and idealistic pushes also correspond to an impersonal, extrospective or objective 
perspective versus a personal, introspective or subjective perspective in science making. 

Chapter 5 explores the stochastic evolution of complex systems and is somewhat inde- 
pendent of chapters 1 to 4. In this chapter, the evolution of scientific theories is analyzed 
within the basic epistemological framework built in chapters 1 to 4. Also, while in chap- 
ters 1 to 4 many of the examples used to illustrate the topics under discussion come from 
statistical modeling, in chapter 5, many of the examples come from stochastic optimiza- 
tion. 

Chapter 6 how some misperceptions in science or misleading interpretations can lead 
to ill-posed problems, paradoxical situations and even misconceived philosophical dilem- 
mas. It also (re)presents some of the key concepts of Cog-Con using simple and intuitive 
examples. Hence, this last chapter may actually be the first one to read. 

Figures J. 2, J. 3 and J.4 illustrate the bicycle metaphor. The first is a cartoon, by 
K.Przibram, of Ludwig Boltzmann, the second a photography of Albert Einstein, and the 
third a photography of Niels Bohr. They are all riding their bikes, an activity that appears 
to be highly beneficial to theoretical Probability. Boltzmann advocated for an atomistic 
and probabilistic interpretation of thermodynamics, that is, viewing thermodynamics as a 
limit approximation of Statistical Mechanics. His position was thoroughly rejected by the 
scientific establishment of his time. One of the main criticisms to his work was the intro- 
duction of "metaphysical" , that is, non "empirical" or non "directly observable" entities. 
In 1905, annus mirabilis, Einstein published his paper on Brownian motion, providing a 
rigorous mathematical description of observable macroscopic fiuctuation phenomena that 
could only be explained in the context of Statistical Mechanics. Sadly, Boltzmann died 
the next year, before his theories were fully appreciated. Discretization and probability 
are also basic concepts in Quantum Mechanics. The famous philosophical debates be- 
tween Bohr and Einstein, involving these two concepts among others, greatly contributed 
to the understanding of the new theory. 
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Basic Tools for the (Home) Works 

The fact that focus of this summer course will be on epistemological questions should not 
be taken as an excuse for working not so hardly with statistical modeling, data analysis, 
computer implementation, and the like. After all, this course will give to successful 
students 4 full credits in the IME-USP graduate programs! 

In the core lectures we will illustrate the topics under discussion with several 'concrete' 
mathematical and statistical models. We have made a conscious effort to choose illus- 
tration models ivolving only mathematical concepts already familiar to our prospective 
students. Actually, most of these models are entail mathematical techniques that are used 
in the analysis and the computational implementation of the FBST, or that are closely 
related to them. Appendices A through K should help the students with their home- 
works. We point out, however, that the presentation quality of these appendices is very 
heterogeneous. Some are (I hope) didactic and well prepared, some are only snapshots 
from slide presentations, and finally, some are just commented computer codes. 

Acknowledgements and Final Remarks 

The main goal of this book is to explore the FBST formalism and Bayesian statistics from 
a constructivist epistemological perspective. In order to accomplish this, ideas from many 
great masters, including philosophers like Peirce, Maturana, von Foerster, and Luhmann, 
statisticians hke Peirce, Fisher, de Finetti, Savage, Good, Kemthorne, Jaynes, Jeffreys 
and Basu, ans physicists like Boltzmann, Planck, de Broglie, Bohr, Heisenberg, and Born 
have been used. I hope it is clear from the text how much I admire and feel I owe to these 
giants, even when my attitude is less then reverential. By that I mean that I always felt 
free to borough from the many ideas I like, and was also unashamed to reject the few I 
do not. The progress of science has always relied on the free and open discussion of ideas, 
in contrast to rigid cults of personality. I only hope to receive from the reader the same 
treatment and that, among the ideas presented in this work, he or she finds some that 
will be considered interesting and worthy of be kept in mind. 

Chapters 1 to 4, released as Stern (2005a) and the Technical Reports Stern (2006a-c), 
have been used in January-February of 2007 (and again for 2008) in the IME-USP Sum- 
mer Program for the disciphne MAE-5747 Comparative Statistical Inference. Chapter 

5, released as the Technical Report by Stern (2007c), has also been used in the second 
semester of 2007 in the discipline MAP-427 - Nonlinear Programming. A short "no-math" 
article based on part of the material in Chapter 1 has been published (in Portuguese) in 
the journal Scientiae Studia. Revised and corrected versions of articles based on the 
material presented at Chapters 1, 2 and 3 have also been either published or accepted 
for publication in the journal Cybernetics & Human Knowing. In the main text and the 
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appendices I have used several results concerning the FBST formalism, some of its appli- 
cations, and also other statistical and optimization models and techniques, developed in 
collaboration with or by other researchers. Appropriate acknowledgements and references 
are given in the text. 

The author has benefited from the support of FAPESP, CNPq, BIOINFO, the Institute 
of Mathematics and Statistics of the University of Sao Paulo, Brazil, and the Mathematical 
Sciences Department at SUNY-Binghamton, USA. The author is grateful to many people 
for helpful discussions, most specially, Wagner Borges, Soren Brier, Carlos Humes, Joseph 
Kadane, Luis Gustavo Esteves, Marcelo Lauretto, Fabio Nakano, Osvaldo Pessoa, Rafael 
Bassi Stern, Sergio Wechsler, and Shelemyahu Zacks. The author also received interesting 
comments and suggestions from the participants of FIS-2005, the Third Conference on 
the Foundations of Information Science, and several anonymous referees. The alchemical 
transmutation of my original drafts into proper English text is a non-trivial operation, in 
which I had the help of Wagner Borges and several referees. 

But first and foremost I want to thank Professor Carlos Alberto de Braganca Pereira 
(Carlinhos). I use to say that he teached me much of the (Bayesian) Statistics I know, the 
easy part, after un-teaching me much of the (frequentist) Statistics I thought I knew, the 
hard part. Carlinhos is a lover of the scientific debate, based on the critical examination 
of concepts and ideas, always poking and probing established habits and frozen ideas with 
challenging questions. This is an attitude, we are told, he shared with his Ph.D. advisor, 
the late Prof. Debabrata Basu. 

Just as an example, one of Carlinhos favorit questions is: Why do we (ever) random- 
ize? I hope that some of the ideas presented in chapter 3 can contribute to the discussion 
of this fundamental issue. Carlinhos extensive consulting practice for the medical com- 
munity makes him (some times, painfully) aware of the need of tempering randomization 
procedures with sophisticated protocols that take into account the patients' need of re- 
ceiving proper care. 

This work has its focus on epistemological aspects. The topics under discussion are, 
however, surprisingly close to, and have many times been directly motivated by, our 
consulting practice in statistical modeling and operations research. The very definition of 
the FBST was originally inspired by some juridical consulting projects, see Stern (2003). 
This does not mean that many of these interrelated issues tend to be ignored in everyday 
practice, like the proverbial bird that ignores the air which supports it, or the fish that 
ignores the water in which it swims. 

The author can be reached at jmstern@hotmail.com . 



Julio Michael Stern 
Sao Paulo, 20/12/2007. 
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Version Control 

- Version 1.0 - December 20, 2007. 

- Version 1.1 - April 9, 2008. Several minor corrections to the main text and some biblio- 
graphic updates. The appendices have been reorganized as follows: Appendix A presents 
a short review of the FBST, including its definition and main statistical and logical proper- 
ties; Appendix B fully reviews the distribution theory used to build Multinomial-Dirichlet 
statistical models; Appendix C summarizes several statistical models used to illustrate 
the core lectures; Appendix D (previously a separate handout) gives a short introduction 
to deterministic optimization; Appendix E reviews some important concepts related to 
the Maximum Entropy formalism and asymptotic convergence; Appendix F, on sparse 
factorizations, provides some technical details related to the discussions on decoupling 
procedures in chapter 3; Appendix G presents a technical miscellanea on Monte Carlo 
Methods; Appendix H provides a short derivation of some stochastic optimization algo- 
rithms and evolution models; Appendix I lists some open research programs; Appendix J 
contains all bitmap figures and, finally. Appendix K brings to bear pieces of difficult to 
get reading material. They will be posted at my web page, subject to the censorship of 
our network administrator and his understanding of Brazilian copyright laws and regula- 
tions. All computer code was removed from text and is now available at my web page, 
www.ime.usp.br/~jstern . 

This version has been used for a tutorial at MaxEnt-2008, the 28th International Work- 
shop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 
held on July 6-11, at Boraceia, Sao Paulo, Brazil. 

- Version 1.2 - December 10, 2008. Minor corrections to the main text and appendices, 
and some bibliographic updates. New section F.l on dense matrix factorizations. This 
section also defines the matrix notation now used consistently throughout the book. 

- Version 2.0 - December 19, 2009. New section 4.5 and chapter 6, presented at the con- 
ference MBR'09 - Model Based Reasoning in Science and Technology - held at Campinas, 
Brazil. Most of the figures at exhibition in the art gallery are now in the separate file, 
www. ime.usp.br/~jstern/pub/gallery2.pdf . 

- Version 2.2 - July 19, 2011. Minor corrections. 



Chapter 1 

Eigen-Solutions and Sharp Statistical 
Hypotheses 

"Eigenvalues have been found ontologically to be 
discrete, stable, separable and composable ..." 

Heinz von Foerster (1911 - 2002), 
Objects: Tokens for Eigen- Behaviours. 



1.1 Introduction 

In this chapter, a few epistemological, ontological and sociological questions concerning 
the statistical significance of sharp hypotheses in the scientific context are investigated 
within the framework provided by Cognitive Constructivism, or the Constructivist Theory 
(ConsTh) as presented in Maturana and Varela (1980), Foerster (2003) and Luhmann 
(1989, 1990, 1995). Several conclusions of the study, however, remain valid, mutatis 
mutandis, within various other organizations and systems, see for example Bakken and 
Hemes (2002), Christis (2001), Mingers (2000) and Rasch (1998). 

The author's interest in this research topic emerged from his involvement in the de- 
velopment of the Full Bayesian Significance Test (FBST), a novel Bayesian solution to 
the statistical problem of measuring the support of sharp hypotheses, first presented in 
Pereira and Stern (1999). The problem of measuring the support of sharp hypotheses 
poses several conceptual and methodological difficulties for traditional statistical analysis 
under both the frequentist (classical) and the orthodox Bayesian approaches. The solution 
provided by the FBST has significant advantages over traditional alternatives, in terms of 
its statistical and logical properties. Since these properties have already been thoroughly 
analyzed in previous papers, see references, the focus herein is directed exclusively to 
epistemological and ontological questions. 
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CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES 



Despite the fact that the FBST is fully compatible with Decision Theory (DecTh), 
as shown in Madruga et al. (2001), which, in turn, provides a strong and coherent 
epistemological framework to orthodox Bayesian Statistics, its logical properties open 
the possibility of using and benefiting from alternative epistemological settings. In this 
chapter, the epistemological framework of ConsTh is counterposed to that of DecTh. The 
contrast, however, is limited in scope by our interest in statistics and is carried out in a 
rather exploratory an non exhaustive form. The epistemological framework of ConsTh is 
also counterposed to that of Falsificationism, the epistemological framework within which 
classical frequentist statistical test of hypotheses are often presented, as shown in Boyd 
(1991) and Popper (1959, 1963). 

In section 2, the fundamental notions of Autopoiesis and Eigen-Solutions in autopoietic 
systems are reviewed. In section 3, the same is done with the notions of Social Systems 
and Functional Differentiation and in section 4, a ConsTh view of science is presented. 
In section 5, the material presented in sections 2, 3 and 4 is related to the statistical 
significance of sharp scientific hypotheses and the findings therein are counterposed to 
traditional interpretations such as those of DecTh. In section 6, a few sociological analyses 
for differentiation phenomena are reviewed. In sections 7 and 8, the final conclusions are 
estabhshed. 

In sections 2, 3, 4, and 6, well established concepts of the ConsTh are presented. 
However, in order to overcome an unfortunately common scenario, an attempt is made 
to make them accessible to a scientist or statistician who is somewhat familiar with 
traditional frequentist, and decision-theoretic statistical interpretations, but unfamiliar 
with the constructivist approach to epistemology. Rephrasing these concepts (once again) 
is also avoided. Instead, quoting the primary sources is preferred whenever it can be clearly 
(in our context) and synthetically done. The contributions in sections 5, 7 and 8, relate 
mostly to the analysis of the role of quantitative methods specifically designed to measure 
the statistical support of sharp hypotheses. A short review of the FBST is presented in 
Appendix A. 



1.2 Autopoiesis and Eigen-Solutions 

The concept of autopoiesis tries to capture an essential characteristic of living organisms 
(auto=self, poiesis=production). Its purpose and definition are stated in Maturana and 
Varela (1980, p.84 and 78-79): 

"Our aim was to propose the characterization of living systems that explains 
the generation of all the phenomena proper to them. We have done this by 
pointing at Autopoiesis in the physical space as a necessary and sufficient 
condition for a system to be a living one. " 
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"An autopoietic system, is organized (defined as a unity) as a network of pro- 
cesses of production (transformation and destruction) of components that pro- 
duces the components which: 

(i) through their interactions and transformations continuously regenerate and 
realize the network of processes (relations) that produced them; and 

(a) constitute it (the machine) as a concrete unity in the space in which they 
(the components) exist by specifying the topological domain of its realization 
as such a network. " 

Autopictic systems are non-equilibrium (dissipativc) dynamical systems exhibiting 
(meta) stable structures, whose organization remains invariant over (long periods of) 
time, despite the frequent substitution of their components. Moreover, these components 
are produced by the same structures they regenerate. For example, the macromolecular 
population of a single cell can be renewed thousands of times during its lifetime, see 
Bertalanffy (1969). The investigation of these regeneration processes in the autopoietic 
system production network leads to the definition of cognitive domain, Maturana and 
Varela (1980, p.lO): 

"The circularity of their organization continuously brings them back to the 
same internal state (same with respect to the cyclic process). Each internal 
state requires that certain conditions (interactions with the environment) be 

satisfied in order to proceed to the next state. Thus the circular organization 
implies the prediction that an interaction that took place once will take place 
again. If this does not happen the system maintains its integrity ( identity with 
respect to the observer) and enters into a new prediction. In a continuously 
changing environment these predictions can only be successful if the environ- 
ment does no change in that which is predicted. Accordingly, the predictions 
implied in the organization of the living system are not predictions of partic- 
ular events, but of classes of inter- actions. Every interaction is a particular 
interaction, but every prediction is a prediction of a class of interactions that 
is defined by those features of its elements that will allow the living system 
to retain its circular organization after the interaction, and thus, to interact 
again. This makes living systems inferential systems, and their domain of 
interactions a cognitive domain. " 

The characteristics of this circular (cyclic or recursive) regenerative processes and their 
eigen (auto, equilibrium, fixed, homeostatic, invariant, recurrent, recursive) -states, both 
in concrete and abstract autopoietic systems, are further investigated in Foerster (2003) 
and Segal (2001): 
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CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES 



"The m,eaning of recursion is to run through one 's own path again. One of its 
results is that under certain conditions there exist indeed solutions which, when 
reentered into the formalism, produce again the same solution. These are called 

"eigen-values", "eigen-functions" , "eigen-behaviors" , etc., depending on which 
domain this formation is applied - in the domain of numbers, in functions, in 
behaviors, etc." Segal (2001, p.l45). 

The concept of eigen-solution for an autopoietic system is the key to distinguish specific 
objects in a cognitive domain, von Foerster also establishes four essential attributes of 
eigen-solutions that will support the analyses conducted in this chapter and conclusions 
estabhshed herein. 

"Objects are tokens for eigen-behaviors. Tokens stand for something else. In 
exchange for money ( a token itself for gold held by one 's government, but 
unfortunately no longer redeemable), tokens are used to gain admittance to 
the subway or to play pinball machines. In the cognitive realm, objects are the 
token names we give to our eig en- behavior. 

This is the constructivist's insight into what takes place when we talk about 
our experience with objects." Segal (2001, p. 127). 

"Eigenvalues have been found ontologically to be discrete, stable, separable and 
composable, while ontogenetically to arise as equilibria that determine them- 
selves through circular processes. Ontologically, Eigenvalues and objects, and 

likewise, ontogenetically, stable behavior and the manifestation of a subject's 
"grasp" of an object cannot be distinguished." Foerster (2003, p. 266). 

The arguments used in this study rely heavily on two qualitative properties of eigen- 
solutions, refered by von Foerster by the terms "Discrete" and "Equilibria". In what 
follows, the meaning of these qualifiers, as they are understood by von Foerster and used 
herein, are examined: 

a- Discrete (or sharp): 

"There is an additional point I want to make, an important point. Out of an 

infinite continuum of possibilities, recursive operations carve out a precise set 
of discrete solutions. Eigen-behavior generates discrete, identifiable entities. 
Producing discreteness out of infinite variety has incredibly important conse- 
quences. It permits us to begin naming things. Language is the possibility 
of carving out of an infinite number of possible experiences those experiences 
which allow stable interactions of yourself with yourself." Segal (2001, p. 128). 



1.2 AUTOPOIESIS 



23 



It is important to realize that, in the sequel, the term "discrete", used by von Foerster 
to qualify eigen-solutions in general, should be replaced, depending on the specific context, 
by terms such as lower-dimensional, precise, sharp, singular etc. Even in the familiar case 
of linear algebra, if we define the eigen- vectors corresponding to a singular eigen- value 
c of a linear transformation T( ) only by its essential property of directional invariance, 
T{x) — cx, we obtain one dimensional sub-manifolds which, in this case, are subspaces 
or lines trough the origin. Only if we add the usual (but non essential) normalization 
condition, ||x|| = 1, do we get discrete eigen- vectors. 




b- Equihbria (or stable): 

A stable eigen-solution of the operator Op{ ), defined by the fixed-point or invariance 
equation, Xinv — Op{xinv), can be found, built or computed as the limit, Xoo, of the 
sequence defined by recursive application of the operator, Xn+i — Op{xn)- Under 

appropriate conditions, such as within a domain of attraction, the process convergence 
and its limit eigen-solution will not depend on the starting point, xq. In the linear algebra 
example, using almost any staring point, the sequence generated by the recursive relation 
Xn+i = T{xn)/\\T{xn)\\ , i-c. thc application of T followed by normalization, converges to 
the unitary eigen-vector corresponding to the largest eigen-value. 

In sections 4 and 5 it is shown, for statistical analysis in a scientific context, how the 
property of sharpness indicates that many, and perhaps some of the most relevant, scien- 
tific hypotheses are sharp, and how the property of stability, indicates that considering 
these hypotheses is natural and reasonable. The statistical consequences of these findings 
will be discussed in sections 7 and 8. Before that, however, a few other ConsTh concepts 
must be introduced in sections 3 and 6. 

Autopoiesis found its name in the work of Maturana and Varela (1980), together with 
a simple, powerful and elegant formulation using the modern language of system's theory. 
Nevertheless, some of the basic theoretical concepts, such as those of self-organization and 
autonomy of living organisms, have long historical grounds that some authors trace back 
to Kant. As seen in Kant (1790, sec. 65) for example, a (self-organized) "Organism" is 
characterized as an entity in which, 

"... every part is thought as 'owing' its presence to the 'agency' of all the 
remaining parts, and also as existing 'for the sake of the others' and of the 
whole, that is as an instrument, or organ. " 

"Its parts must in their collective unity reciprocally produce one another alike 
as to form and combination, and thus by their own causality produce a whole, 
the conception of which, conversely, -in a being possessing the causality ac- 
cording to conceptions that is adequate for such a product- could in turn be the 
cause of the whole according to a principle, so that, consequently, the nexus 
of 'efficient causes' (progressive causation, nexus effectivus) might be no less 
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estimated as an 'operation brought about by final causes' (regressive causation, 
nexus finalis). " 

For further historical comments we refer the reader to Zelleny (1980). 

1.3 Functional Differentiation 

In order to give appropriate answers to environmental complexities, autopoietic systems 
can be hierarchically organized as Higher Order Autopoietic Systems. As in Maturana 
and Varela (1980, p. 107,109), this notion is defined via the concept of Coupling: 

"Whenever the conduct of two or more units is such that there is a domain in 
which the conduct of each one is a function of the conduct of the others, it is 
said that they are coupled in that domain. " 

"An autopoietic system whose autopoiesis entails the autopoiesis of the coupled 
autopoietic units which realize it, is an autopoietic system of higher order. " 

A typical example of a hierarchical system is a Beehive, a third order autopoietic 
system, formed by the coupling of individual Bees, the second order systems, which, in 
turn, are formed by the coupling of individual Cells, the first order systems. 

The philosopher and sociologist Niklas Luhmann applied this notion to the study 
of modern human societies and its systems. Luhmann's basic abstraction is to look at 
social systems only at its higher hierarchical level, in which it is seen as an autopoietic 
communications network. In Luhmann's terminology, a communication event consists of: 
Utterance, the form of transmission; Information, the specific content; and Understanding, 
the relation to future events in the network, such as the activation or suppression of future 
communications . 

"Social systems use communication as their particular mode of autopoietic 
(re)production. Their elements are communications that are recursively pro- 
duced and reproduced by a network of communications that are not living units, 
they are not conscious units, they are not actions. Their unity requires a syn- 
thesis of three selections, namely information, utterance and understanding 
(including misunderstanding)." Luhmann (1990b, p. 3). 

For Luhmann, society's best strategy to deal with increasing complexity is the same as 
one observes in most biological organisms, namely, differentiation. Biological organisms 
differentiate in specialized systems, such as organs and tissues of a pluricellular life form 
(non-autopoietic or allopoietic systems), or specialized individuals in an insect colony 
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(autopoietic system). In fact, societies and organisms can be characterized by the way in 
which they differentiate into systems. For Luhmann, modern societies are characterized 
by a vertical differentiation into autopoietic functional systems, where each system is 
characterized by its code, program and (generalized) media. The code gives a bipolar 
reference to the system, of what is positive, accepted, favored or valid, versus what is 
negative, rejected, disfavored or invalid. The program gives a specific context where the 
code is applied, and the media is the space in which the system operates. 

Standard examples of social systems are: 

- Science: with a true/false code, working in a program set by a scientific theory, and 
having articles in journals and proceedings as its media; 

- Judicial: with a legal/illegal code, working in a program set by existing laws and 
regulations, and having certified legal documents as its media; 

- Religion: with a good/evil code, working in a program set by sacred and hermeneutic 
texts, and having study, prayer and good deeds as its media; 

- Economy: with a property/lack thereof code, working in a program set by economic 
planning scenarios and pricing methods, and having money and money-like financial assets 
as its media. 

Before ending this section, a notion related to the break-down of autopoiesis is intro- 
duced: Dedifferentiation (Entdifferenzierung) is the degradation of the system's internal 
coherence, through adulteration, disruption, or dissolution of its own autopoietic rela- 
tions. One form of dedifferentiation (in either biological or social systems) is the system's 
penetration by external agents who try to use system's resources in a way that is not 
compatible with the system's autonomy. In Lumann's conception of modern society each 
system may be aware of events in other systems, that is, be cognitively open, but is 
required to maintain its differentiation, that is, be operationally closed. In Luhmann's 
(1989, p. 109) words: 



"With functional differentiation... Extreme elasticity is purchased at the cost 
of the peculiar rigidity of its contextual conditions. Every binary code claims 

universal validity, hut only for its own perspective. Everything, for example, 
can he either true of false, hut only true or false according to the specific 
theoretical programs of the scientific system. Above all, this means that no 
function system can step in for any other. None can replace or even relieve any 
other. Politics can not he substituted for economy, nor economy for science, 
nor science for law or religion, nor religion for politics, etc., in any conceivable 
intersystem relations. " 
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1.4 Eigensolutions and Scientific Hypotheses 

The interpretation of scientific knowledge as an eigensolution of a research process is part 
of a constructive approach to cpistemology. Figure 1 presents an ideahzed structure and 
dynamics of knowledge production. This diagram represents, on the Experiment side (left 
column) the laboratory or field operations of an empirical science, where experiments are 
designed and built, observable effects are generated and measured, and the experimental 
data bank is assembled. On the Theory side (right column), the diagram represents the 
theoretical work of statistical analysis, interpretation and (hopefully) understanding ac- 
cording to accepted patterns. If necessary, new hypotheses (including whole new theories) 
are formulated, motivating the design of new experiments. Theory and experiment con- 
stitute a double feed-back cycle making it clear that the design of experiments is guided 
by the existing theory and its interpretation, which, in turn, must be constantly checked, 
adapted or modified in order to cope with the observed experiments. The whole system 
constitutes an autopoietic unit, as seen in Krohn and Kuppers (1990, p. 214): 

"The idea of knowledge as an eigensolution of an operationally closed combina- 
tion between argumentative and experimental activities attempts to answer the 
initially posed question of how the construction of knowledge binds itself to its 
construction in a new way. The coherence of an eigensolution does not refer 
to an objectively given reality but follows from the operational closure of the 
construction. Still, different decisions on the selection of couplings may lead 
to different, equally valid eigensolutions. Between such different solutions no 
reasonable choice is possible unless a new operation of knowledge is constructed 
exactly upon the differences of the given solutions. But again, this frame of 
reference for explicitly relating different solutions to each other introduces new 
choices with respect to the coupling of operations and explanations. It does 
not reduce but enhances the dependence of knowledge on decisions. On the 
other hand, the internal restrictions imposed by each of the chosen couplings 
do not allow for any arbitrary construction of results. Only few are suitable 
to mutually serve as inputs in a circular operation of knowledge. " 

1.5 Sharp Statistical Hypotheses 

Statistical science is concerned with inference and application of probabilistic models. 
From what has been presented in the preceding sections, it becomes clear what the role 
of Statistics in scientific research is, at least in the ConsTh view of scientific research: 
Statistics has a dual task, to be performed both in the Theory and the Experiment sides 
of the diagram in Figure 1: 
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Experiment Theory 

Operation- <^ Experiment <^ Hypotheses 

ahzation design formulation 

^ t 

Effects True/False Creative 

observation eigen-solution interpretation 

Data Mnemetic Statistical 

acquisition =^ explanation analysis 

Sample space Parameter space 



Figure 1: Scientific production diagram. 

- At the Experiment side of the diagram, the task of statistics is to make probabilistic 
statements about the occurrence of pertinent events, i.e. describe probabilistic distribu- 
tions for what, where, when or which events can occur. If the events are to occur in the 
future, these descriptions are called predictions, as is often the case in the natural sci- 
ences. It is also possible (more often in social sciences) to deal with observations related 
to past events, that may or may not be experimentally generated or repeated, imposing 
limitations to the quantity and/or quality of the available data. Even so, the habit of 
calling this type of statement "predictive probabilities" will be maintained. 

- At the Theory side of the diagram, the role of statistics is to measure the statistical 
support of hypotheses, i.e. to measure, quantitatively, the hypotheses plausibility or 
possibility in the theoretical framework where they were formulated, given the observed 
data. From the material presented in the preceding sections, it is also clear that, in 
this role, statistics is primarily concerned with measuring the statistical support of sharp 
hypotheses, for hypotheses sharpness (precision or discreteness) is an essential attribute 
of eigen-solutions. 

Let us now examine how well the traditional statistical paradigms, and in contrast the 
FBST, are able to take care of this dual task. In order to examine this question, the first 
step is to distinguish what kind of probabilistic statements can be made. We make use of 
tree statement categories: Frequentist, Epistemic and Bayesian: 

Frequentist probabilistic statements are made exclusively on the basis of the frequency 
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of occurrence of an event in a (potentially) infinite sequence of observations generated by 
a random variable. 

Epistemic probabilistic statements are made on the basis of the epistemic status (de- 
gree of belief, likelihood, truthfulness, validity) of an event from the possible outcomes 
generated by a random variable. This generation may be actual or potential, that is, may 
have been realized or not, may be observable or not, may be repeated an infinite or finite 
number of times. 

Bayesian probabilistic statements are epistemic probabilistic statements generated by 
the (in practice, always finite) recursive use of Bayes formula: 

Pn{9) (X Pn-l{0)p{Xn\0) . 

In standard models, the parameter 6, a non observed random variable, and the sample 
X, an observed random variable, are related through their joint probability distribution, 
p[x,6). The prior distribution, po{ff), is the starting point for the Bayesian recursion 
operation. It represents the initial available information about 9. In particular, the prior 
may represent no available information, like distributions obtained via the maximum 
entropy principle, see Dugdale (1996) and Kapur (1989). The posterior distribution, Pn{0), 
represents the available information on the parameter after the n-th "learning step", in 
which Bayes formula is used to incorporate the information carried by observation Xn- 
Because of the recTirsivc nature of the procedure, the posterior distribution in a given 
step is used as prior in the next step. 

Prequentist statistics dogmatically demands that all probabilistic statements be fre- 
quentist. Therefore, any direct probabilistic statement on the parameter space is cate- 
gorically forbidden. Scientific hypotheses are epistemic statements about the parameters 
of a statistical model. Hence, frequentist statistics can not make any direct statement 
about the statistical significance (truthfulness) of hypotheses. Strictly speaking it can 
only make statements at the Experiment side of the diagram. The frequentist way of 
dealing with questions on Theory side of the diagram, is to embed them some how into 
the Experiment side. One way of doing this is by using a construction in which the whole 
data acquisition process is viewed as a single outcome of an imaginary infinite meta ran- 
dom process, and then make a frequentist statement, on the meta process, about the 
frequency of unsatisfactory outcomes of some incompatibility measure of the observed 
data bank with the hypothesis. This is the classic (and often forgotten) rationale used 
when stating a p-value. So we should always speak of the p-value of the data bank (not 
of the hypothesis). The resulting conceptual confusion and frustration (for most working 
scientists) with this kind of convoluted reasoning is captured by a wonderful parody of 
Galileo's dialogues in Rouanet et al. (1998). 

A p-value is the probability of getting a sample that is more extreme than the one we 
got. We should therefore specify which criterion is used to define what we mean by more 
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extreme, i.e., how do we order the sample space, and usually there are several possible 
criteria to do that, for examples, see Pereira and Wechsler (1993). 

Post. Prob. NPW p-value 




-50 50 -50 



Figure 2: Independence Hypothesis, n=16. 

Figure 2 compares four statistics, namely, orthodox Bayesian posterior probabilities, 
Neyman-Pearson-Wald (NPW) p-values, Chi-square approximate p-values, and the FBST 
evidence value in favor of H. In this example H is the independence hypothesis in a 2 x 2 
contingency table, for sample size n = 16, see section Al and Bl. The horizontal axis 
shows the "diagonal asymmetry" statistics (difference between the diagonal products). 
The statistics D is an estimator of an unormalized version of Person's correlation coeffi- 
cient, p. For detailed explanations, see Irony et al. (1995, 2000), Stern and Zacks (2002) 
and Madruga, Pereira and Stern (2003). 

0'1,2 ^1,1^^2,2 — ^^1. 2^2,1 



Samples that are "perfectly compatible with the hypothesis" , that is, having no asym- 
metry, are near the center of the plot, with increasingly incompatible samples to the sides. 
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The envelope curve for the resuhing FBST e- values, to be commented later in this section, 
is smooth and therefore level at its maximum, where it reaches the value 1. 

In contrast the envelope curves for the p- values take the form of a cusp, i.e. a pointed 
curve, that is broken (non differentiable) at its maximum, where it also reaches the value 
one. The acuteness of the cusp also increases with increasing sample size. In the case 
of NPW p- values we see, at the top of the cusp, a "ladder" or "spike", with several 
samples with no asymmetry, but having different outcome probabilities, "competing" for 
the higher p- value. 

This is a typical collateral effect of the artifice that converts a question about the 
significance of H, asking for a probability in the parameter space as an answer, into a 
question, conditional on H being truth, about the outcome probability of the observed 
sample, offering a probability in the sample space as an answer. This qualitative analysis 
of the p- value methodology gives us an insight on typical abuses of the expression "increase 
sample size to reject". In the words of I.J. Good (1983, p. 135): 

"Very often the statistician doesn't bother to make it quite clear whether his 
null hypothesis is intended to be sharp or only approximately sharp.... 

It is hardly surprising then that many Fisherians (and Popperians) say that 
- you can't get (much) evidence in favor of the null hypothesis but can only 
refute it." 

In Bayesian statistics we are allowed to make probabilistic statements on the parameter 
space, and also, of course, in the sample space. Thus it seems that Bayesian statistics is the 
right tool for the job, and so it is! Nevertheless, we must first examine the role played by 
DecTh in orthodox Bayesian statistics. Since the pioneering work of de Finetti, Savage and 
many others, orthodox Bayesian Statistics has developed strong and coherent foundations 
grounded on DecTh, where many basic questions could be successfully analyzed and 
solved. 

This foundations can be stratified in two layers: 

- In the first layer, DecTh provides a coherence system for the use of probability state- 
ments, in the sense of Finetti (1974, 1981). In this context, the FBST use of probability 
theory is fully compatible with DecTh, as shown in Madruga et al. (2001). 

- In the second layer, DecTh provides an epistemological framework for the interpre- 
tation of statistical procedures. The FBST logical properties open the possibility of using 
and benefiting from alternative epistemological settings such as ConsTh. Hence, DecTh 
does not have to be "the tool for all trades" . 

We claim that, in the specific case of statistical procedures for measuring the support 
(significance tests) for sharp scientific hypotheses, ConsTh provides a more adequate 
epistemological framework than DecTh. This point is as important as it is subtle. In order 
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to understand it let us first remember the orthodox paradigm, as it is concisely stated 
in Dubins and Savage (1965, 12.8, p. 229, 230). In a second quote, from Savage (1954, 
16.3, p. 254) we find that sharp hypotheses, even if important, make little sense in this 
paradigm, a position that is accepted throughout decision theoretic Bayesian statistics, 
as can also be seen in Levi (1974) and Maher et al. (1993). 

"Gambling problems in which the distributions of various quantities are promi- 
nent in the description of the gambler's fortune seem to embrace the whole of 
theoretical statistics according to one view (which might be called the decision- 
theoretic Bayesian view) of the subject. 

...From the point of view of decision-theoretic statistics, the gambler in this 
problem is a person who must ultimately act in one of two ways (the two 
guesses), one of which would be appropriate under one hypothesis (Hq) and 
the other under its negation (Hi). 

...Many problems, of which this one is an instance, are roughly of the following 
type. A person's opinion about unknown parameters is described by a proba- 
bility distribution; he is allowed successively to purchase bits of information 
about the parameters, at prices that may depend (perhaps randomly) upon the 
unknown parameters themselves, until he finally chooses a terminal action for 
which he receives an award that depends upon the action and parameters. " 

"I turn now to a different and, at least for me, delicate topic in connection with 
applications of the theory of testing. Much attention is given in the literature of 
statistics to what purport to be tests of hypotheses, in which the null hypothesis 
is such that it would not really be accepted by anyone. ... extreme (sharp) 
hypotheses, as I shall call them. . . 

...The unacceptability of extreme (sharp) null hypotheses is perfectly well known; 
it is closely related to the often heard maxim that science disproves, but never 
proves, hypotheses. The role of extreme (sharp) hypotheses in science and 
other statistical activities seems to be important but obscure. In particular, 
though I, like everyone who practice statistics, have often "tested" extreme 
(sharp) hypotheses, I cannot give a very satisfactory analysis of the process, 
nor say clearly how it is related to testing as defined in this chapter and other 
theoretical discussions. " 

As it is clearly seen, in the DecTh framework we speak about the betting odds for 
"the hypothesis wining on a gamble taking place in the parameter space". But since 
sharp hypotheses are zero (Lebesgue) measure sets, our betting odds must be null, i.e. 
sharp hypotheses must be (almost surely) false. If we accept the ConsTh view that an 
important class of hypotheses concern the identification of eigen-solutions, and that those 
are ontologically sharp, we have a paradox! 
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From these considerations it is not surprising that frequentist and DecTh orthodoxy 
consider sharp hypotheses, at best as anomalous crude approximations used when the 
scientist is incapable of correctly specifying error bounds, cost, loss or utility functions, 
etc., or then just consider them to be "just plain silly". In the words of D.Williams (2002, 
p.234): 

"Bayesian significance of sharp hypothesis: a plea for sanity: ...It astonishes 
me therefore that some Bayesian now assign non-zero prior probability that a 
sharp hypothesis is exactly true to obtain results which seem to support strongly 
null hypotheses which frequentists would very definitely reject. ( Of course, it 
is blindingly obvious that such results must follow)." 

But no matter how many times statisticians reprehend scientist for their sloppiness 
and incompetence, they keep formulating sharp hypotheses, as if they where magnetically 
attracted to them. From the ConsTh plus FBST perspective they are, of course, just 
doing the right thing! 

Decision theoretic statistics has also developed methods to deal with sharp hypotheses, 
posting sometimes a scary caveat emptor for those willing to use them. The best known 
of such methods are Jeffreys' tests, based on Bayes Factors that assign a positive prior 
probability mass to the sharp hypothesis. This positive prior mass is supposed to work 
like a handicap system designed to balance the starting odds and make the game "fair" . 
Out of that we only get new paradoxes, like the well documented Lindley's paradox. In 
opposition to its frequentist counterpart, this is an "increase sample size to accept" effect, 
see Shafer (1982). 

The FBST e- value or evidence value supporting the hypothesis, ev (H), was specially 
designed to effectively evaluate the support for a sharp hypothesis, H. This support 
function is based on the posterior probability measure of a set called the tangential set, 
T{H), which is a non zero measure set (so no null probability paradoxes), see Pereira and 
Stern (1999), Madruga et al. (2003) and subsection Al of the appendix. 

Although ev (H) is a probability in the parameter space, it is also a possibilistic sup- 
port function. The word possibilistic carries a heavy load, implying that ev (H) complies 
with a very specific logic (or algebraic) structure, as seen in Darwishe and Ginsberg 
(1992), Stern (2003, 2004), and subsection A3 of the appendix. Furthermore the e- value 
has many necessary or desirable properties for a statistical support function, such as: 

1- Give an intuitive and simple measure of significance for the hypothesis in test, 
ideally, a probability defined directly in the original or natural parameter space. 

2- Have an intrinsically geometric definition, independent of any non-geometric aspect, 
like the particular parameterization of the (manifold representing the) null hypothesis 
being tested, or the particular coordinate system chosen for the parameter space, i.e., be 
an invariant procedure. 
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3- Give a measure of significance that is smooth, i.e. continuous and dijjerentiable, on 
the hypothesis parameters and sample statistics, under appropriate regularity conditions 
of the model. 

4- Obey the likelihood principle , i.e., the information gathered from observations 
should be represented by, and only by, the likelihood function. 

5- Require no ad hoc artifice like assigning a positive prior probability to zero measure 
sets, or setting an arbitrary initial belief ratio between hypotheses. 

6- Be a possibilistic support function. 

7- Be able to provide a consistent test for a given sharp hypothesis. 

8- Be able to provide compositionality operations in complex models. 

9- Be an exact procedure, not requiring "large sample" asymptotic approximations. 

10- Allow the incorporation of previous experience or expert's opinion via (subjective) 

prior distributions. 

For a careful and detailed explanation of the FBST definition, its computational imple- 
mentation, statistical and logical properties, and several ahcady developed applications, 
the reader is invited to consult some of the articles in the reference list. Appendix A 
provides a short review of the FBST, including its definition and main properties. 

1.6 Semantic Degradation 

In this section some constructivist analyses of dedifferentiation phenomena in social sys- 
tems are reviewed. If the conclusions in the last section are correct, it is surprising how 
many times DecTh, sometimes with a very narrow pseudo-economic interpretation, was 
misused in scientific statistical analysis. The difficulties of testing sharp hypotheses in 
the traditional statistical paradigms are well documented, and extensively discussed in 
the literature, see for example the articles in Harlow et al. (1997). We hope the material 
in this section can help us understand these difficulties as symptoms of problems with 
much deeper roots. By no means the author is the first to point out the danger of analy- 
ses carried out by blind transplantation of categories between heterogeneous systems. In 
particular, regarding the abuse of economical analyses, Luhmann (1989, p. 164) states: 

"In this sense, it is meaningless to speak of "non- economic" costs. This is only 
a metaphorical way of speaking that transfers the specificity of the economic 
mode of thinking indiscriminately to other social systems. " 

For a sociological analysis of this phenomenon in the context of science, see for example 
Puchs (1996, p.310) and DiMaggio and Powell (1991, p.63): 
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"...higher- status sciences may, more or less aggressively, colonize lower-status 
fields in an attempt at reducing them to their own First Principles. For particle 
physics, all is quarks and the four forces. For neurophysiology, consciousness 
is the aggregate outcome of the behavior of neural networks. For sociobiol- 
ogy, philosophy is done by ants and rats with unusual large brains that utter 
metaphysical nonsense according to acquired reflexes. In short, successful and 
credible chains or reductionism usually move from the top to the bottom of 
disciplinary prestige hierarchies. " 

"This may explain the popularity of giving an "economical understanding" to 
processes in functionally distinct areas even if (or perhaps because) this se- 
mantics is often hidden by statistical theory and methods based on decision 
theoretic analysis. This also may explain why some areas, like ecology, so- 
ciology or psychology, are (or where) far more prone to suffer this kind of 
dedifferentiation by semantic degradation than others, like physics. " 

Once the forces pushing towards systemic degradation are clearly exposed, we hope 
one can understand the following corollary of von Foerster famous ethical and aesthetical 
imperatives: 

- Theoretical imperative: Preserve systemic autopoiesis and semantic integrity, for de- 
differentiation is in-sanity itself. 

- Operational imperative: Chose the right tool for each job: "If you only have a hammer, 
everything looks like a nail" . 

1.7 Competing Sharp Hypotheses 

In this section we examine the concept of Competing Sharp Hypotheses. This concept has 
several variants, but the basic idea is that a good scientist should never test a single sharp 
hypothesis, for it would be an unfair faith of the poor sharp hypothesis standing all alone 
against everything else in the world. Instead, a good scientist should always confront a 
sharp hypothesis with a competing sharp hypotheses, making the test a fair game. As 
seen in Good (1983, p.167,135,126): 

"Since I regard refutation and corroboration as both valid criteria for this de- 
marcation it is convenient to use another term. Checkability, to embrace both 
processes. I regard checkability as a measure to which a theory is scientific, 
where checking is to be taken in both its positive and negative senses, confirm- 
ing and disconfirming. " 

"...If by the truth of Newtonian mechanics we mean that it is approximately 
true in some appropriate well defined sense we could obtain strong evidence 
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that it is true; but if we mean by its truth that it is exactly true then it has 
already been refuted. " 

"...I think that the initial probability is positive for every self- consistent scien- 
tific theory with consequences verifiable in a probabilistic sense. No contradic- 
tion can be inferred from this assumption since the number of statable theories 
is at most countably infinite (enumerable)." 

"...It is very difficult to decide on numerical values for the probabilities, but it 
is not quite so difficult to judge the ratio of the subjective initial probabilities 
of two theories by comparing their complexities. This is one reason why the 
history of science is scientifically important. " 

The competing sharp hypotheses argument does not directly contradict the episte- 
mological framework presented in this chapter, and it may be appropriate under certain 
circumstances. It may also mitigate or partially remediate the paradoxes pointed out 
in the previous sections when testing sharp hypotheses in the traditional frequentist or 
orthodox Bayesian settings. However, the author does not believe that having compet- 
ing sharp hypotheses is neither a necessary condition for good science practice, nor an 
accurate description of science history. 

Just to stay with Good's example, let us quickly examine the very first major inci- 
dent in the tumultuous debacle of Newtonian mechanics. This incident was Michelson's 
experiment on the effect of "aethereal wind" over the speed of light, see Michelson and 
Morley (1887) and Lorentz et al. (1952). A clear and lively historical account to this 
experiment can be found in Jaffe (1960). Actually Michelson found no such effect, i.e. he 
found the speed of light to be constant, invariant with the relative speed of the observer. 
This result, a contradiction in Newtonian mechanics, is easily explained by Einstein's 
special theory of relativity. The fundamental difference between the two theories is their 
symmetry or invariance groups: Galileo's group for Newtonian mechanics, Lorentz' group 
for special relativity. A fundamental result of physics, Noether's Theorem, states that for 
every continuous symmetry in a physical theory, there must exist an invariant quantity 
or conservation law. For detail the reader is refered to Byron and Fuller (1969, V-I, Sec. 
2.7), Doncel et al. (1987), Gruber et al. (1980-98), Houtappel et al. (1965), French 
(1968), Landau and Lifchitz (1966), Noether (1918), Wigner (1970), Weyl (1952). Con- 
servation laws are sharp hypotheses ideally suited for experimental checking. Hence, it 
seems that we are exactly in the situation of competing sharp hypotheses, and so we are 
today, from a far away historical perspective. But this is a post-mortem analysis of New- 
tonian mechanics. At the time of the experiment there was no competing theory. Instead 
of confirming an effect, specified only within an order of magnitude, Michelson found, for 
his and everybody else's astonishment, an, up to the experiment's precision, null effect. 

Complex experiments like Michelson's require a careful analysis of experimental errors, 
identifying all significant source of measurement noise and fiuctuation. This kind of 
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analysis is usual in experimental physics, and motivates a brief comment on a secondary 
source of criticism on the use of sharp hypotheses. In the past, one often had to work 
with over simplified statistical models. This situation was usually imposed by limitations 
such as the lack of better or more realistic models, or the unavailability of the necessary 
numerical algorithms or the computer power to use them. Under these limitations, one 
often had to use minimalist statistical models or approximation techniques, even when 
these models or techniques were not recommended. These models or techniques were 
instrumental to provide feasible tools for statistical analysis, but made it very difficult to 
work (or proved very ineffective) with complex systems, scarce observations, very large 
data sets, etc. The need to work with complex models, and other difficult situations 
requiring the use of sophisticated statistical methods and techniques, is very common 
(and many times inescapable) in research areas dealing with complex systems like biology, 
medicine, social sciences, psychology, and many other fields, some of them distinguished 
with the mysterious appellation of "soft" science. A colleague once put it to me like this: 
"It seems that physics got all the easy problems...". 

If there is one area where the computational techniques of Bayesian statistics have 
made dramatic contributions in the last decades, that is the analysis of complex models. 
The development of advanced statistical computational techniques fike Markov Chain 
Monte Carlo (MCMC) methods, Bayesian and neural networks, random fields models, 
and many others, make us hope that most of the problems related to the use of over 
simplified models can now be overcome. Today good statistical practice requires all sta- 
tistically significant infiuences to be incorporated into the model, and one seldom finds 
an acceptable excuse not to do so; see also Pereira and Stern (2001). 



1.8 Final Remarks 



It should once more be stressed that most of the material presented in sections 2, 3, 
4, and 6 is not new in ConsTh. Unfortunately ConsTh has had a minor impact in 
statistics, and sometimes provoked a hostile reaction from the ill-informed. One possible 
explanation of this state of affairs may be found in the historical development of ConsTh. 
The constructivist reaction to a dogmatic realism prevalent in hard sciences, specially in 
the XIX and the beginning of the XX century, raised a very outspoken rhetoric intended 
to make explicitly clear how naive and fragile the foundations of this over simplistic 
realism were. This rhetoric was extremely successful, quickly awakening and forever 
changing the minds of those directly interested in the fields of history and philosophy 
of science, and spread rapidly into many other areas. Unfortunately the same rhetoric 
could, in a superficial reading, make ConsTh be perceived as either hostile or intrinsically 
incompatible with the use of quantitative and statistical methods, or leading to an extreme 
forms of subjectivism. 
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In ConsTh, or (objective) Idealism as presented in this chapter, neither does one claim 
to have access to a "thing in itself" or "Ding an sich" in the external environment, see 
Caygill (1995), as do dogmatic forms of realism, nor does one surrender to solipsism, as do 
skeptic forms of subjectivism, including some representatives of the subjectivist school of 
probability and statistics, as seen in Finetti (1974, 1.11, 7.5.7). In fact, it is the role of the 
external constraints imposed by the environment, together with the internal autopoietic 
relations of the system, to guide the convergence of the learning process to precise eigen- 
solutions, these being at the end, the ultimate or real objects of scientific knowledge. As 
stated by Luhmann (1990a, 1995): 

"... constructivism maintains nothing more than the unapproachability of the 
external world "in itself" and the closure of knowing - without yielding, at any 
rate, to the old skeptical or "solipsistic" doubt that an external world exists at 
all-..." Luhmann (1990a, p.65). 

"...at least in systems theory, they (statements) refer to the real world. Thus 
the concept of system refers to something that in reality is a system and thereby 
incurs the responsibility of testing its statements against reality. " Luhmann 
(1995, p.l2). 

"...both subjectivist and objectivist theories of knowledge have to be replaced by 
the system / environment distinction, which then makes the distinction subject 
/ object irrelevant." Luhmann (1990a, p. 66). 

The author hopes to have shown that ConsTh not only gives a balanced and effective 
view of the theoretical / experimental aspects of scientific research but also that it is well 
suited (or even better suited) to give the necessary epistemological foundations for the 
use of quantitative methods of statistical analysis needed in the practice of science. It 
should also be stressed, according to author's interpretation of ConsTh, the importance of 
measuring the statistical support for sharp hypotheses. In this setting, the author believes 
that, due to its statistical and logical characteristics, the FBST is the right tool for the 
job, and hopes to have motivated the reader to find more about the FBST definition, 
theoretical properties, efficient computational implementation, and several of the already 
developed applications, in some of the articles in the reference list. This perspective opens 
interesting areas for further research. Among them, we mention the following two. 

1.8.1 Noether and de Finetti Theorems 

The first area for further research has to do with some similarities between Noether the- 
orems in physics, and de Finetti type theorems in statistics. Nother theorems provide 
invariant physical quantities or conservation laws from symmetry transformation groups 
of the physical theory, and conservation laws are sharp hypotheses by excellence. In 
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a similar way, dc Finctti type theorems provide invariant distributions from symmetry 
transformation groups of the statistical model. Those invariant distributions can in turn 
provide prototypical sharp hypotheses in many application areas. Physics has its own 
heavy apparatus to deal with the all important issues of invariance and symmetry. Statis- 
tics, via de Finetti theorems, can provide such an apparatus for other areas, even in 
situations that are not naturally embedded in a heavy mathematical formalism, see Feller 
(1968, ch.7) and also Diaconis (1987, 1988), Eaton (1989), Nachbin (1965) Renyi (1970) 
and Ressel (1987). 

1.8.2 Compositionality 

The second area for further research has to do with one of the properties of eigen-solutions 
mentioned by von Foerster that has not been directly explored in this chapter, namely 
that eigen-solutions are "composable" , see Borges and Stern (2005) and section A4. Com- 
positionality properties concern the relationship between the credibility, or truth value, 
of a complex hypothesis, H, and those of its elementary constituents, H\ j = 1 . . . k. 
Compositionality questions play a central role in analytical philosophy. 

According to Wittgenstein (2001, 2.0201, 5.0, 5.32): 

- Every complex statement can be analyzed from its elementary constituents. 

- Truth values of elementary statement are the results of those statements' truth- 
functions ( Wahrheitsfunktionen) . 

- All truth-function are results of successive applications to elementary constituents 
of a finite number of truth-operations ( Wahrheitsoperationen) . 

Compositionality questions also play a central role in far more concrete contexts, like 
that of reliability engineering, see Birnbaum et al. (1961, 1.4): 

"One of the main purposes of a mathematical theory of reliability is to develop 
means by which one can evaluate the reliability of a structure when the relia- 
bility of its components are known. The present study will be concerned with 
this kind of mathematical development. It will be necessary for this purpose 
to rephrase our intuitive concepts of structure, component, reliability, etc. in 
more formal language, to restate carefully our assumptions, and to introduce 
an appropriate mathematical apparatus." 

In Luhmann (1989, p. 79) we find the following remark on the evolution of science that 
directly hints the importance of this property: 

"After the (science) system worked for several centuries under these condi- 
tions it became clear where it was leading. This is something that idealization. 
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mathematization, abstraction, etc. do not describe adequately. It concerns the 
increase in the capacity of decomposition and recombination, a new formula- 
tion of knowledge as the product of analysis and synthesis. In this case analysis 
is what is most important because the further decomposition of the visible world 
into still further decomposable molecules and atoms, into genetic structures of 
life or even into the sequence human/role /action/ action- components as ele- 
mentary units of systems uncovers an enormous potential for recombination. " 

In the author's view, the composition (or re-combination) of scientific knowledge and 
its use, so relevant in technology development and engineering, can give us a different per- 
spective (perhaps a, bottom-up, as opposed to the top-down perspective in this chapter) 
on the importance of sharp hypotheses in science and technology practice. It can also 
provide some insight on the valid forms of iteration of science with other social systems 
or, in Luhmann's terminology, how science does (or should) "resonate" in human society. 
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Chapter 2 

Language and the Self-Reference 
Paradox 



"// the string is too tight it will snap, 
but if it is too loose it will not play. " 

Siddhartha Gautama. 

"The most beautiful thing we can experience is the mysterious. 

It is the source of all true art and all science. He to whom 
this emotion is a stranger, who can no longer pause to wonder 
and stand rapt in awe, is as good as dead: His eyes are closed. " 

Albert Einstein (1879 - 1955). 



2.1 Introduction 

In Chapter 1 it is shown how the eigen-solutions found in the practice of science are 
naturally represented by statistical sharp hypotheses. Statistical sharp hypotheses are 
routinely stated as natural "laws", conservation "principles" or invariant "transforms", 
and most often take the form of functional equations, like h{x) = c. Chapter 1 also 
discusses why the eigen-solutions' essential attributes of discreteness (sharpness), stability, 
and composability, indicate that considering such hypotheses in the practice of science 
is natural and reasonable. Surprisingly, the two standard statistical theories for testing 
hypotheses, classical (frequentist p- values) and orthodox Bayesian (Bayes factors), have 
well known and documented problems for handling or interpreting sharp hypotheses. 
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These problems are thoroughly reviewed, from statistical, methodological, systemic and 
epistemological perspectives. 

Chapter 1 and appendix A present the FBST, or Full Bayesian Significance Test, an 
unorthodox Bayesian significance test specifically designed for this task. The mathemati- 
cal and statistical properties of the FBST are carefully analyzed. In particular, it is shown 
how the FBST fully supports the test and identification of eigen-solutions in the practice 
of science, using procedures that take into account all the essential attributes pointed by 
von Foerster. In contrast to some alternative belief calculi or logical formahsms based on 
discrete algebraic structures, the FBST is based on continuous statistical models. This 
makes it easy to support concepts like sharp hypotheses, asymptotic convergence and 
stability, and these are essential concepts in the representation of eigen-solutions. The 
same chapter presents cognitive constructivism as a coherent epistemological framework 
that is compatible with the FBST formalism, and vice-versa. I will refer to this setting 
as the Cognitive Constructivism plus FBST formalism, or CogCon-|-FBST framework for 
short. 

The discussion in Chapter 1 raised some interesting questions, some of which we will 
try to answer in the present chapter. The first question relates to the role and the 
importance of language in the emergence of eigen-solutions and is discussed in section 2. 
In answering it, we make extensive use of the William Rasch "two-front war" metaphor 
of cognitive constructivism, as exposed in Rasch (2000). As explained in section 4, this is 
the war against dogmatic realism at one front, and against skepticism or solipsism, at the 
second. The results of the first part of the paper are summarized in section 5. To illustrate 
his arguments, Rasch uses some ideas of Niels Bohr concerning quantum mechanics. In 
section 3, we use some of the same ideas to give concrete examples of the topics under 
discussion. The importance (and also the mystery) related to the role of language in the 
practice of science was one of the major concerns of Bohr's philosophical writings, see 
Bohr (1987, I-IV), as exemphfied by his famous "dirty dishes" metaphor: 

"Washing dishes and language can in some respects be compared. We have 
dirty dishwater and dirty towels and nevertheless finally succeed in getting the 
plates and glasses clean. Likewise, we have unclear terms and a logic limited 
in an unknown way in its field of application - but nevertheless we succeed in 
using it to bring clearness to our understanding of nature." Bohr (2007). 

The second question, posed by S0ren Brier, which asks whether the CogCon+FBST 
framework is compatible with and can benefit from the concepts of Semiotics and Peircean 
philosophy, is addressed in section 6. In section 7 I present my final remarks. 

Before ending this section a few key definitions related to the concept of eigen-solution 
are reviewed. As stated in Maturana and Varela (1980, p. 10), the concept of recurrent 
state is the key to understand the concept of cognitive domain in an autopoietic system. 
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"Living system,s as units of interaction specified by their conditions of being 
living systems cannot enter into interactions that are not specified by their or- 
ganization. The circularity of their organization continuously brings them back 
to the same internal state (same with respect to the cyclic process). Each inter- 
nal state requires that certain conditions (interactions with the environment) 
be satisfied in order to proceed to the next state. Thus the circular organization 
implies the prediction that an interaction that took place once will take place 
again. If this does not happen the system maintains its integrity (identity with 
respect to the observer) and enters into a new prediction. In a continuously 
changing environment these predictions can only be successful if the environ- 
ment does no change in that which is predicted. Accordingly, the predictions 
implied in the organization of the living system are not predictions of partic- 
ular events, but of classes of inter- actions. Every interaction is a particular 
interaction, but every prediction is a prediction of a class of interactions that 
is defined by those features of its elements that will allow the living system 
to retain its circular organization after the interaction, and thus, to interact 
again. This makes living systems inferential systems, and their domain of 
interactions a cognitive domain." 



The epistemological importance of this circular (cychc or recursive) regenerative pro- 
cesses and their eigen (auto, equihbrium, fixed, homeostatic, invariant, recurrent, recur- 
sive) -states, both in concrete and abstract autopoietic systems, are further investigated 
in Foerster and Segal (2001, p. 145, 127-128): 



"The meaning of recursion is to run through one 's own path again. One of 
its results is that under certain conditions there exist indeed solutions which, 
when reentered into the formalism, produce again the same solution. These 
are called "eigen-values" , "eigen-functions" , "eigen-behaviors" , etc., depend- 
ing on which domain this formation is applied - in the domain of numbers, in 
functions, in behaviors, etc. " 



"Objects are tokens for eigen-behaviors. Tokens stand for something else. 
In exchange for money ( a token itself for gold held by one 's government, but 
unfortunately no longer redeemable), tokens are used to gain admittance to 
the subway or to play pinball machines. In the cognitive realm, objects are 
the token names we give to our eig en-behavior. When you speak about a ball, 
you are talking about the experience arising from your recursive sensorimotor 
behavior when interacting with that something you call a ball. The "ball" as 
object becomes a token in our experience and language for that behavior which 
you know how to do when you handle a ball. This is the constructivist 's insight 
into what takes place when we talk about our experience with objects. " 
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Von Foerster also establishes several essential attributes of these eigen-solutions, as 
quoted in the following paragraph from Foerster (2003c, p. 266). These essential attributes 
can be translated into very specific mathematical properties, that are of prime importance 
when investigating several aspects of the CogCon+FBST framework. 

"Eigenvalues have been found ontologically to be discrete, stable, separable and 
composable, while ontogenetically to arise as equilibria that determine them- 
selves through circular processes. Ontologically, Eigenvalues and objects, and 
likewise, ontogenetically, stable behavior and the manifestation of a subject 's 

"grasp" of an object cannot be distinguished." 

2.2 Eigen-solutions and Language 

Goudsmit (1998, sec. 2. 3. 3, Objects as warrants for eigenvalues), finds an apparent dis- 
agreement between the form in which eigen-solutions emerge, according to von Foster and 
Maturana: 

"Generally, von Foerster s concept of eigenvalue concerns the value of a func- 
tion after a repeated (iterative) application of a particular operation. ... 

This may eventually result in a stable performance, which is an eigenvalue of 
the observers behavior. The emerging objects are warrants of the existence of 
these eigenvalues. 

. . . contrary to von Foerster, Maturana considers the consensuality of distinc- 
tions as necessary for the bringing forth of objects. It is through the attain- 
ment of consensual distinctions that individuals are able to create objects in 
language. " 

Confirmation for the position attributed by Goudsmit to von Foerster can be found 
in several of his articles. In Foerster (2003a, p. 3), for example, one finds: 

"... I propose to continue the use of the term 'self-organizing system,' whilst 
being aware of the fact that this term becomes meaningless, unless the system 

is in close contact with an environment, which possesses available energy and 
order, and with which our system is in a state of perpetual interaction, such 
that it somehow manages to 'live' on the expenses of this environment. ... 
... So both the self- organizing system plus the energy and order of the envi- 
ronment have to be given some kind of pre- given objective reality for this view 
points to function. " 
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Confirmation for tlie position attributed by Goudsmit to Maturana can also be found 
in several of liis articles. In Maturana (1988, sec. 9. iv), for example, one finds: 

"Objectivity. Objects arise in language as consensual coordinations of actions 
that in a domain of consensual distinctions are tokens for more basic coordina- 
tions of actions, which they obscure. Without language and outside language 
there are no objects because objects only arise as consensual coordinations of 
actions in the recursion of consensual coordinations of actions that languaging 
is. For living systems that do not operate in language there are no objects; or 
in other words, objects are not part of their cognitive domains. ... Objects are 
operational relations in languaging. " 

The standpoint of Maturana is further characterized in the following paragraphs from 
Brier (2005, p.374): 

"The process of human knowing, is the process in which we, through languag- 
ing, create the difference between the world and ourselves; between the self 
and the non-self, and thereby, to some extent, create the world by creating 
ourselves. But we do it by relating to a common reality which is in some 
way before we made the difference between 'the world' and 'ourselves' make 
a difference, and we do it on some kind of implicit belief in a basic kind of 
order 'beneath it all'. I do agree that it does not make sense to claim that the 
world exists completely independently of us. But on the other hand it does not 
make sense to claim that it is a pure product of our explanations or conscious 
imagination. " 

"...it is clear that we do not create the trees and the mountains through our 
experiencing or conversation alone. But Maturana is close to claim that this 
is what we do. " 

In order to understand the above comments, one must realize that Maturana's view- 
points, or at least his rhetoric, changed greatly over time, ranging from the ponderate and 
precise statements in Maturana and Varela (1980), to some extreme positions assumed 
in Maturana (1991, p. 36-44)), see next paragraph. Maturana must have had in mind the 
celebrated quote by Albert Einstein at the beginning of this chapter. 

"Einstein said, and many other scientists have agreed with him, that sci- 
entific theories are free creations of the human mind, and he marveled that 
through them one could understand the universe. The criterion of validation 
of scientific explanation as operations in the praxis of living of the observer, 
however, permit us to see how it is that the first reflection of Einstein is valid, 
and how it is that there is nothing marvelous in that it is so. " 
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"Scientific explanations arise operationally as generative mechanisms accepted 
by us as scientists through operations that do not entail or imply any suppo- 
sition about an independent reality, so that in fact there is no confrontation 
with one, nor is it necessary to have one even if we believe that we can have 
one. " 

"Quantification (or measurements) and predictions can be used in the genera- 
tion of a scientific explanation but do not constitute the source of its validity. 
The notions of falsifiability (Popper), verificability, or confirmation would ap- 
ply to the validation of scientific knowledge only if this were a cognitive domain 
that revealed, directly or indirectly, by denotation or connotation, a transcen- 
dental reality independent of what the observer does..." 

"Nature is an explanatory proposition of our experience with elements of our 
experience. Indeed, we human beings constitute nature with our explaining, 
and with our scientific explaining we constitute nature as the domain in which 
we exist as human beings (or languaging living systems)." 

Brier (2005, p. 375) further contrasts the standpoint of Maturana with that of von 
Foerster: 

"Von Foerster is more aware of the philosophical demand that to put up a new 
epistemological position one has to deal with the problem of solipsism and of 

pure social constructivism. " 

"The Big enf unctions do not just come out of the blue. In some, yet only dimly 
viewed, way the existence of nature and its 'things ' and our existence are in- 
tertwined in such a way that makes it very difficult to talk about. Von Foerster 
realizes that to accept the reality of the biological systems of the observer leads 
into further acceptance about the structure of the environment. " 

While the position adopted by von Foerster appears to be more reahstic or objective, 
the one adopted by Maturana seems more Ideahstic or (inter) subjective. Can these two 
different positions, which may seem so discrepant, be reconciled? Do we have to chose 
between an idealistic or a realistic position, or can we rather have both? This is one of 
the questions we address in the next sections. 

In Chapter 1 we used an example of physical eigen-solution (physical invariant) to 
illustrate the ideas in discussion, namely, the speed of light constant, c. Historically, 
this example is tied to the birth of Special Relativity theory, and the debacle of classical 
physics. In this chapter we will illustrate them with another important historical exam- 
ple, namely, the Einstein-Podolsky-Rosen paradox. Historically, this example is tied to 
questions concerning the interpretation of quantum mechanics. This is one of the main 
topics of the next section. 
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2.3 The Languages of Science 

At the end of the 19th century, classical physics was the serene sovereign of science. 
Its glory was consensual and uncontroversial. However, at the beginning of the 20th 
century, a few experimental results challenged the explanatory power of classical physics. 
The problems appeared in two major fronts that, from a historical perspective, can be 
linked to the theories (at that time still non existent) of Special Relativity and quantum 
mechanics. 

At that time, the general perception of the scientific community was that these few 
open problems could, should and would be accommodated in the framework of classical 
physics. Crafting sophisticated structural models such as those for the structure of ether 
(the medium in which light was supposed to propagate), and those for atomic structure, 
was typical of the effort to circumvent these open problems by artfully maneuvering 
classical physics. But physics and engineering laboratories insisted, building up a barrage 
of new and challenging experimental results. 

The difficulties with the explanations offered by classical physics not only persisted, 
but also grew in number and strength. In 1940 the consensus was that classical physics 
had been brutally defeated, and Relativity and quantum mechanics were acclaimed as 
the new sovereigns. Let us closely examine some facts concerning the development of 
quantum mechanics (QM). 

One of the first steps in the direction of a comprehensive QM theory was given in 1924 
by Louis de Broglie, who postulated the particle-wave duality principle, which states that 
every moving particle has an associated pilot wave of wavelength A = h/mv, where h is 
Planck's constant and mv is the particle's momentum, i.e., the product of its mass and 
velocity. In 1926 Erwin Schrodinger stated his wave equation, capable of explaining all 
known quantic phenomena, and predicting several new ones that where latter confirmed by 
new experiments. Schrodinger theory is known as Orthodox QM, see Tomonaga (1962) 
and Pais (1988) for detailed historical accounts. Orthodox QM uses a mathematical 
formalism based on a complex wave equation, and shares much of the descriptive language 
of de Broglie's particle-wave duality principle. 

There is, however, something odd in the wave-particle descriptions of orthodox QM. 
When describing a model we speak of each side of a double faced wave-particle entity, as if 
each side existed by itself, and then inextricably fuse them together in the mathematical 
formahsm. Quoting Cohen (1989, p. 87), 

"Notice how our language shapes our imagination. To say that a particle is 
moving in a straight line really means that we can set up particle detectors 
along the straight line and observe the signals they send. These signals would 
be consistent with a model of the particle as a single chunk of mass moving 
(back and forth) in accordance with Newtonian particle physics. It is important 
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to emphasize that we are not claiming that we know what the particle is, but 
only what we would observe if we set up those particle detectors. " 

From Schroedinger's equation we can derive Heisenberg's uncertainty principle, which 
states that we can not go around measuring everything we want until we pin down ev- 
ery single detail about (the classical entities in our wave-particle model of) reality. One 
instance of the Heisenberg uncertainty principle states that we can not simultaneously 
measure a particle position and momentum beyond a certain accuracy. One way of inter- 
preting this instance of the Heisenberg uncertainty principle goes as follows: In classical 
Newtonian physics our particles are "big enough" so that our measurement devices can 
obtain the information we need about the particle without disturbing it. In QM, on the 
other hand, the particles are so small that the measurement operation will always disturb 
the particle. For example, the light we have to use in order to illuminate the scene, so 
we can sec where the particle is, has to be so strong, relative to the particle size, that it 
"blows" the particle away changing its velocity. The consequence is that we cannot (nei- 
ther in practice, nor even in principle) simultaneously measure with arbitrary precision, 
both the particle's position and momentum. Hence, we have to learn how to tame our 
imagination and constrain our language. 

The need to exercise a strict discipline over what kinds of statements to use was a 
lesson learned by 20th century physics - a lesson that mathematics had to learn a bit 
earher. A classical example from set theory of a statement that cannot be allowed is the 
Russell's catalog (class, set), defined in Robert (1988, p.x) as: 

"The 'catalogue of all catalogues not mentioning themselves. ' Should one in- 
clude this catalogue in itself? ... Both decisions lead to a contradiction!" 

Robert (1988) indicates several ways to avoiding this paradox (or antinomy). All of 
them imply imposing a (very reasonable) set of rules on how to form valid statements. 
Under any of these rules, Russell's definition becomes an invalid or ill posed statement and, 
as such, should be disregarded, see Halmos (1998, ch.l and 2) and Dugundji (1966, ch.l) 
for introductory texts and Aczel (1988) for an alternative view. Measure theory (of Borel, 
Lebesgue, Haar, etc.) was a fundamental achievement of 20th century mathematics. It 
defines measures (notions such as mass, volume and probability) for parts of R^. However 
not all parts of i?" are included, and we must refrain of speaking about the measure of 
inadmissible (non-measurable) sets, see Ulam (1943) for a short article, Kolmogorov and 
Fomin (1982) for a standard text, and Nachbin (1965) and Bernardo (1993) for extensions 
pertinent to the FBST formalism. The main subject in Robert (1988) is Non Standard 
Analysis, a form of extending the languages of both Set Theory and Real Analysis, see the 
observations in section 6.6 and also Davis (1977, sec. 3. 4), Goldblatt (1998) and Nelson 
(1987). 
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All the preceding examples of mathematical languages have one thing in common: 
When crafting a specific language, one has to carefully define what kinds of statements 
are accepted as valid ones. Proper use of the language must be constrained to valid 
statements. Such constraints are necessary in order to preserve language coherence. 

The issue of what kinds of statements should be accepted as valid in QM is an inter- 
esting and still subsisting issue, epitomized by the famous debate at the Brussels Solvay 
conference of 1930 between Niels Bohr and his friend and opponent Albert Einstein. 
Ruhla (1992, eh. 7 and 8) and Baggott (1992, under the topic hidden variables) give very 
intuitive reviews of the subject, requiring minimal mathematical expertise. Without the 
details concerning the physics involved, one can describe the debate as: While Bohr sug- 
gested very strict rules for admissible statements in QM, Einstein advocated for more 
amiable ones. In 1935 Einstein, Podolsky and Rosen suggested a "gedankenexperiment" , 
known as the EPR paradox, as a compelling argument supporting Einstein's point of 
view. D.Bohm, in 1952 and J.Bell, in 1964, contributed to the debate by showing that 
the EPR paradox could lead to concrete experiments providing a way to settle the de- 
bate on empirical grounds. It was only in 1972 that the first EPR experiment could be 
performed in practice. The observational evidence from these experiments seems to favor 
Bohr's point of view! 

One of today's standard formahsms for QM is Abstract QM, see Hughes (1992) or 
Chester (1987) for a readable text and Cohen (1989) for a concise and formal treatment. 
For an alternative formalism based on Niels Bohr's concept of complementarity, see Bohr 
(1987, I-IV) and Costa and Krause (2004). Other formalisms may also become usefuU, 
see for example Kolmanovskii and Nosov (1986, sec. 2. 3) and Zubov (1983). Abstract QM, 
which is very clean and efficient, can be stratified in two layers. In the first layer, all basic 
calculations are carried out using an algebra of operators in (Rigged) Hilbert spaces. In a 
second layer, the results of these calculations are interpreted as probabilities of obtaining 
specific results in physical measurements, see also Rijsbergen (2004). One advantage of 
using the stratified structure of abstract QM is that it naturally avoids (most of) the 
danger of forming invalid statements in QM language. Cohen (1989, p.vii) provides the 
following historical summary: 



"Historically, ... quantum mechanics developed in three stages. First came 

a collection of ad hoc assumptions and then a cookbook of equations known 
as (orthodox) quantum mechanics. The equations and their philosophical un- 
derpinning were then collected into a model based on mathematics of Hilbert 
space. From the Hilbert space model came the abstraction of quantum logics. " 

Prom the above historical comments we draw the following conclusions: 
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3.1. Each of the QM formahsms discussed in this section, namely, de Broghe 
wave-particle duality principle, Schrodinger orthodox QM and Hilbert space 
abstract QM, operates like a language. Maturana stated that objects arise in 
language. He seems to be right. 

3.2. It seems also that new languages must be created (or discovered) to 
provide us the objects corresponding to the structure of the environment, as 
stated by von Foerster. 

3.3. Exercising a strict discipline concerning what kinds of statements can be 
used in a given language and context, seems to be vital in many areas. 

3.4. It is far from trivial to create, craft, discover, find and/or use a language 

so that "it works", providing us the "right" objects (eigen-solutions). 

3.5. Even when everything looks (for the entire community) fine and well, 
new empirical evidence can bring our theories down as a castle of cards. 

As indicated by an anonymous referee, abstract formalisms or languages do not exist in 
a vacuum, but sit on top of (or are embedded in) natural (or less abstract) languages. This 
bring us to the interesting and highly relevant issues of hierarchical language structures 
and constructive ladders of objects, including interdependence analyses between objects 
at different levels of such complex structures, see Piaget (1975) for an early reference. For 
a recent concrete example of the scientific relevance of such interdependences in the field 
of Psychology, using a Factor Analysis statistical model, see Shedler and Westen (2004, 
2005) ; These issues are among of the main topics addressed in chapter 3 and forthcoming 
articles. 

2.4 The Self- Reference Paradox 

The conclusions established in the previous section may look reasonable. In 3.4, however, 
what exactly are the "right" objects? Clearly, the "right" objects are "those" objects we 
more or less clearly see and can point at, using as reference language the language we 
currently use. 

There! I have just fallen, head-on, into the quicksands of the self-reference paradox. 
Don't worry (or do worry), but note this: The self-reference paradox is unavoidable, 
especially as long as we use English or any other natural human language. 

Rasch (2000, p. 73, 85) has produced a very good description of the self-reference para- 
dox and some of its consequences: 

"having it both ways seems a necessary consequence... One cannot just have 
it dogmatically one way, nor skeptically the other... One oscillates, therefore. 
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between the two positions, neither denying reality nor denying reality's essen- 
tially constructed nature. One calls this not idealism or realism, but (cognitive) 
constructivism. " 

"What do we call this oscillation? We call it paradox. Self - reference and 
paradox - sort of like love and marriage, horse and carriage. " 

Cognitive Constructivism implies a double rejection: That of a solipsist denial of 
reality, and that of any dogmatic knowledge of the same reality. Rasch uses the "two 
front war" metaphor to describe this double rejection. Carrying the metaphor a bit 
further, the enemies of cognitive constructivism could be portrayed, or caricatured, as: 

- Dogmatism despotically requires us to believe in its (latest) theory. Its 
statements and reasons should be passively accepted with fanatic resignation 
as infaUible truth; 

- Solipsism's anarchic distrust wishes to preclude any established order in the 
world. Solipsism wishes to transform us into autistic skeptics, incapable of 
establishing any stable knowledge about the environment in which we live. 
We refer to Caygill (1995, dogmatism) for a historical perspective on the Kan- 
tian use of some of the above terms. 

Any military strategist will be aware of the danger in the oscillation described by 
Rasch, which alternately exposes a weak front. The enemy at our strong front will be 
subjugated, but the enemy at our weak front will hit us hard. Rasch sees a solution to 
this conundrum, even recognizing that this solution may be difficult to achieve, Rasch 
(2000, p.85): 

"There is a third choice: to locate oneself directly on the invisible line that must 
he drawn for there to he a distinction mind / body (system / environment) in 
the first place. Yet when one attempts to land on that perfect center, one 
finds oneself oscillating wildly from side to side, perhaps preferring the mind 
(system) side, hut over compensating to the body (environment) side - or vice 
versa. 

The history of post-Kantian German idealism is a history of the failed search 
for this perfect middle, this origin or neutral ground outside both mind and 
body that would nevertheless actualize itself as a perfect transparent mind/body 
within history. Thus, much of contemporary philosophy that both follows 
and rejects that tradition has become fascinated by, even if trapped in, the 
mind/body oscillation. " 

So, the question is: How do we land on Rasch' fine (invisible) line, finding the perfect 
center and avoiding dangerous oscillations? This is the topic of the next section. 
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2.5 Objective Idealism and Pragmatism 

We are now ready for a few definitions of basic epistemological terms. These definitions 
should help us build epistemic statements in a clear and coherent form according to the 
CogCon+FBST perspective. 

5.1. Known (knowable) Object: An actual (potential) eigen-solution of a 
given system's interaction with its environment. In the sequel, we may use a 
somewhat more friendly terminology by simply using the term Object. 

5.2. Objective (how, less, more): Degree of conformance of an object to 
the essential attributes of an eigen-solution. 

5.3. Reality: A (maximal) set of objects, as recognized by a given system, 
when interacting with single objects or with compositions of objects in that 
set. 

5.4. Idealism: Belief that a system's knowledge of an object is always de- 
pendent on the systems' autopoietic relations. 

5.5. Realism: Behef that a system's knowledge of an object is always de- 
pendent on the environment's constraints. 

5.6. Solipsism, Skepticism: Idealism without Realism. 

5.7. Dogmatic Realism: Realism without Ideahsm. 

5.8. Realistic or Objective Idealism: Idealism and Realism. 

5.9. "Something in itself": This expression, used in reference to a specific 
object, is a marker or label for ill posed statements. 

Cog-Con+FBST assumes an objective and idealistic epistemology. Definition 5.9 la- 
bels some ill posed dogmatic statements. Often, the description of the method used to 
access something in itself looks like: 

- Something that an observer would observe if the (same) observer did not exist, or 

- Something that an observer could observe if he made no observations, or 

- Something that an observer should observe in the environment without interacting 
with it (or disturbing it in any way), and many other equally nonsensical variations. 

Some of the readers may not like this form of labeling this kind of invalid statement, 
preferring to use, instead, a more elaborate terminology, such as "object in parenthesis" 
(approximately) as object, "object without parenthesis" (approximately) as something 
in itself, etc. There may be good reasons for doing so, for example, this elaborate lan- 
guage has the advantage of automatically stressing the differences between constructivist 
and dogmatic epistemologies, see Maturana (1988), Maturana and Poerksen (2004) and 
Steier (1991). Nevertheless, we have chosen our definitions in agreement with some very 
pragmatic advice given in Bopry (2002): 
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"Objectivity as defined by a (dogmatic) realist epistemology may not exist 
within a constructivist epistemology; but, part of making that alternative epis- 
temology acceptable is gaining general acceptance of its terminology. As long 
as the common use of the terms is at odds with the concepts of an epistemolog- 
ical position, that position is at a disadvantage. Alternative forms of inquiry 
need to coopt terminology in a way that is consistent with its own epistemology. 
I suggest that this is not so difficult. The term objective can be taken back..." 

Among the definitions 5.1 to 5.9, definition 5.2 plays a key role. It allows us to say 
how well an eigen-solution manifests von Foerster's essential attributes, and consequently, 
how good (objective) is our knowledge of it. However, the degree of objectivity can not 
be assessed in the abstract, it must be assessed by the means and methods of a given 
empirical science, namely the one within which the eigen solution is presented. Hence, 
definition 5.2 relies on an "operational approach", and not on metaphysical arguments. 
Such an operational approach may be viewed with disdain by some philosophical schools. 
Nevertheless, for C.S.Peirce it is 

"The Kernel of Pragmatism", CP 5.464-465: 

"Suffice it to say once more that pragmatism is, in itself, no doctrine of meta- 
physics, no attempt to determine any truth of things. It is merely a method 
of ascertaining the meanings of hard words and of abstract concepts. ... All 
pragmatists will further agree that their method of ascertaining the meanings 
of words and concepts is no other than that experimental method by which all 
the successful sciences (in which number nobody in his senses would include 
metaphysics) have reached the degrees of certainty that are severally proper 
to them today; this experimental method being itself nothing but a particular 
application of an older logical rule, 'By their fruits ye shall know them '. " 

Definition 5.2 also requires a belief calculus specifically designed to measure the sta- 
tistical significance, that is, the degree of support of empirical data to the existence of an 
eigen-solution. In Chapter 1 we showed why confirming the existence of an eigen-solution 
naturally corresponds to testing a sharp statistical hypotheses, and why the mathematical 
properties of FBST c- values correspond to the essential attributes of an eigen-solution as 
stated by von Foerster. In this sense, the FBST calculus is perfectly adequate to support 
the use of the term Objective and correlated terms in scientific language. Among the 
most important properties of the e-value mentioned in Chapter 1 and Appendix A, we 
find: 

Continuity: Give a measure of significance that is smooth, i.e. continuous and dif- 
ferentiable, on the hypothesis parameters and the sample statistics, under appropriate 
regularity conditions of the statistical model. 
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Consistency: Provide a consistent, that is, asymptotically convergent significance mea- 
sure for a given sharp hypothesis. 

Therefore, the FBST calculus is a formalism that allow us to assess, continuously and 
consistently, the objectivity of an eigen-solution, by means of a convergent significance 
measure, see Chapter 1. We should stress, once more, that achieving comparable goals 
using alternative formalisms based on discrete algebraic structures may be, in general, 
rather difficult. Hence, our answer to the question of how to land on Rasch's perfect 
center is: Replace unstable oscillation for stable convergence! 

Any dispute about objectivity (epistemic quality or value of an object of knowledge), 
should be critically examined and evaluated within this pragmatic program. This program 
(in the Luhmann's sense) includes the means and methods of the empirical science in which 
the object of knowledge is presented, and the FBST belief calculus, used to evaluate the 
empirical support of an object, given the available experimental data. 

Even if over optimistic (actually hopelessly utopic), it is worth restating Leibniz' flag 
of Calculemus, as found in Gerhardt (1890, v. 7, p. 64-65): 

"Quo facto, quando orientur controversiae, non magis disputatione opus erit 
inter duos philosophos, quam inter duos Computistas. Sufficiet enim calamos 
in manus sumere sedereque ad abacos, et sibi mutuo (accito si placet amico) 
dicere: Calculemus. " 

A contemporary translation could read: Actually, if controversies were to arise, there 
would be no more need for dispute between two philosophers, rather than between two 
statisticians. For them it would suffice to reach their computers and, in friendly under- 
standing, say to each other: Let us calculate! 

2.6 The Philosophy of C.S.Peirce 

In the previous sections we presented an epistemological perspective based on a pragmatic 
objective idealism. Objective idealism and pragmatism are also distinctive characteristics 
of the philosophy of C.S.Peirce. Hence the following question, posed by S0ren Brier, that 
we examine in this section: Is the CogCon+FBST framework compatible with and can it 
benefit from the concepts of Semiotics and Peircean philosophy? 

In Chapter 1 we had already explored the idea that eigen-solutions, as discrete entities, 
can be named, i.e., become signs in a language system, as pointed by von Foerster in Segal 
(2001, p.128): 

"There is an additional point I want to make, an important point. Out of an 
infinite continuum of possibilities, recursive operations carve out a precise set 
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of discrete solutions. Eig en-behavior generates discrete, identifiable entities. 
Producing discreteness out of infinite variety has incredibly important conse- 
quences. It permits us to begin naming things. Language is the possibility 
of carving out of an infinite number of possible experiences those experiences 
which allow stable interactions of yourself with yourself. " 

We believe that the process of recursively "discovering" objects of knowledge, identify- 
ing them by signs in language systems, and nsing these languages to "think" and structure 
our lives as self-concious beings, is the key for understanding concepts such as significa- 
tion and meaning. These ideas are explored, in a great variety of contexts, in Bakken and 
Hemes (2002), Brier (1995), Ceruti (1989), Efran et al. (1990), Eibel-Eibesfeldt (1970), 
Ibri (1992), Piaget (1975), Wenger et al. (1999), Winograd and Flores (1987) and many 
others. Conceivably, the key underlying common principle is stated in Brier (2005, p. 395): 

"The key to the understanding of understanding, consciousness, and com- 
munication is that both the animals and we humans live in a self-organized 
signification sphere which we not only project around us but also project deep 
inside our systems. Von Uexkiill calls it "Innenwelt" (Brier 2001). The or- 
ganization of signs and the meaning they get through the habits of mind and 
body follow very much the principles of second order cybernetics in that they 
produce their own Eigenvalues of sign and meaning and thereby create their 
own internal mental organization. I call this realm of possible sign processes 
for the signification sphere. In humans these signs are organized into language 
through social self-conscious communication, and accordingly our universe is 
organized also as and through texts. But of course that is not an explanation 
of meaning. " 

When studying the organization of self-conscious beings and trying to understand 
semantic concepts such as signification and meaning, or teleological concepts such as fi- 
nality, intent and purpose, we move towards domains concerning systems of increasing 
complexity that are organized as higher hierarchical structures, like the domains of phe- 
nomenological, psychological or sociological sciences. In so doing, we leave the domains of 
natural and technical sciences behind, at least for a moment, see Brent and Bruck (2006) 
and Muggleton (2006), in last month's issue of Nature (March 2006, when this article was 
written), for two perspectives on future developments. 

As observed in Brier (2001), the perception of the objects of knowledge, changes from 
more objective or realistic to more idealistic or (inter) subjective as we progress to higher 
hierarchical levels. Nevertheless, we believe that the fundamental nature of objects of 
knowledge as eigen-solutions, with all the essential attributes pointed out by von Foerster, 
remains just the same. Therefore, a sign, as understood in the CogCon-|-FBST framework, 
always stands for the following triad: 
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S-1. Some perceived aspects, characteristics, etc., concerning the organization 

of the autopoictic system. 

S-2. Some perceived aspects, characteristics, etc., concerning the structure of 
the system's environment. 

S-3. Some object (discrete, separable, stable and composable eigen-solution 
based on the particular aspects stated in S-1 and S-2) concerning the interac- 
tion of the autopoietic system with its environment. 

This triadic character of signs bring us, once again, close to the semiotic theory of 
C.S.Peirce, offering many opportunities for further theoretical and applied research. For 
example, we are currently using statistical psychometric analyses in an applied semiotic 
project for the development of software user interfaces, for related examples see Ferreira 
(2006). We defer, however, the exploration of these opportunities to forthcoming articles. 

In the remainder of this section we focus on a more basic investigation that, we believe, 
is a necessary preliminary step that must be undertaken in order to acquire a clear con- 
ceptual horizon that will assist a sound and steady progress in our future research. The 
purpose of this investigation is to find out whether the CogCon-|-FBST framework can 
find a truly compatible ground in the basic concepts of Peircean philosophy. We proceed 
establishing a conceptual mapping of the fundamental concepts used to define the Cog- 
Con+FBST epistemological framework into analogous concepts in Peircean philosophy. 
Before we start, however, a word of caution: The work of C.S.Peirce is extremely rich, 
and open to many alternative interpretations. Our goal is to establish the compatibil- 
ity of CogCon+FBST with one possible interpretation, and not to ascertain reductionist 
deductions, in any direction. 

The FBST is a Continuous Statistical formalism. Our first step in constructing this 
conceptual mapping addresses the following questions: Is such a formalism amenable to a 
Perircean perspective? If so, which concepts in Peircean philosophy can support the use 
of such a formalism? 

6.1 Probability and Statistics: The FBST is a probability theory based statistical 
formalism. Can the probabilistic concepts of the FBST find the necessary support in 
concepts of Peircean philosophy? We believe that Tychism is such a concept in Peircean 
philosophy, providing the first element in our conceptual mapping. In CP 6.201 Tychism 
is defined as: 

"... the doctrine that absolute chance is a factor of the universe." 

6.2 Continuity: As stated in the previous section, the CogCon+FBST program pursues 
the stable convergence of the epistemic e-values given by the FBST formalism. The 
fact that FBST is a belief calculus based on continuous mathematics is essential for 
its consistency and convergence properties. Again we have to ask: Does the continuity 
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concept used in the FBST formalism have an analogous concept in Peircean philosophy? 
We believe that the analogy can be established with the concept of Synechism, thus 
providing the second element in our conceptual mapping. 

In CP 6.169 synechism is defined as: 

"that tendency of philosophical thought which insists upon the idea of continu- 
ity as of prime importance in philosophy and, in particular, upon the necessity 
of hypotheses involving true continuity. " 

6.3 Eigen-Solutions: A key epistemological concept in the CogCon +FBST perspective 
is the notion of eigen-solution. Although the system theoretic concept of Eigen-solution 
cannot possibly have an exact correspondent in Peirce philosophy, we believe that Peirce's 
fundamental concept of "Habit" or "Insistency" offers an adequate analog. Habit, and 
reality, are defined as: 

"The existence of things consists in their regular behavior.", CP 1.411. 

"Reality is insistency. That is what we mean by 'reality'. It is the brute 
irrational insistency that forces us to acknowledge the reality of what we expe- 
rience, that gives us our conviction of any singular.", CP 6.340. 

However, the CogCon+FBST concept of eigen-solution is characterized by von Foer- 
ster by several essential attributes. Consequently, in order that the conceptual mapping 
under construction can be coherent, these characteristics have to be mapped accordingly. 
In the following paragraphs we show that the essential attributes of sharpness (discrete- 
ness), stability and compositionality can indeed be adequately represented. 

6.3a Shcirpness: The first essential attribute of eigen-solutions stated by von Foerster 
is discreteness or sharpness. As stated in Chapter 1, it is important to realize that, in 
the sequel, the term 'discrete', used by von Foerster to qualify eigen-solutions in general, 
should be replaced, depending on the specific context, by terms such as lower- dimensional, 
precise, sharp, singular, etc. As physical laws or physical invariants, sharp hypotheses are 
formulated as mathematical equations. 

Can Peircean philosophy offer a good support for sharp hypotheses? Again we believe 
that the answer is in the affirmative. The following quotations should make that clear. 
The first three passages are taken from Ibri (1992, p. 84-85) and the next two from CP, 
1.487 and CP 1.415, see also NEM 4, p.136-137 and CP 6.203. 

"an object (a thing) IS only in comparison with a continuum of possibilities 
from which it was selected. " 
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"Existence involves choice; the dice of infinite faces, from potential to actual, 
will have the concreteness of one of them. " 

"...as a plane is a hi- dimensional singularity, relative to a tri- dimensional 
space, a line in a plane is a topic discontinuity, hut each of this elements is 
continuous in its proper dimension. " 

" Whatever is real is the law of something less real. Stuart Mill defined matter 
as a permanent possibility of sensation. What is a permanent possibility but a 
law?" 

"In fact, habits, from the mode of their formation, necessarily consist in the 
permanence of some relation, and therefore, on this theory, each law of nature 
would consist in some permanence, such as the permanence of mass, momen- 
tum, and energy. In this respect, the theory suits the facts admirably. " 

6.3b Stability: The second essential attribute of eigen-solutions stated by von Foerster 
is stability. As stated in Stern (2005), a stable eigen-solution of an operator, defined by 
a fixed-point or invariance equation, can be found (built or computed) as the limit of a 
sequence of recursive applications of the operator. Under appropriate conditions (such 
as within a domain of attraction, for instance) the process convergence and its limiting 
eigen-solution will not depend on the starting point. 

A similar notion of stability for an object-sign complex is given by Peirce. As stated 
in CP 1.339: 

"That for which it (a sign) stands is called its object; that which it conveys, 
its meaning; and the idea to which it gives rise, its interpretant. The object 
of representation can be nothing but a representation of which the first repre- 
sentation is the interpretant. But an endless series of representations , each 
representing the one behind it, may be conceived to have an absolute object at 
its limit." 

6.3c Compositionality: The third essential attribute of eigen-solutions stated by von 
Foerster is compositionality. As stated in Chapter 1 and Appendix A, compositionality 
properties concern the relationship between the credibility, or truth value, of a complex 
hypothesis, H, and those of its elementary constituents, , j = 1 . . . k. Compositionality 
is at the very heart of any theory of language, sec Noeth (1995). As an example of 
compositionality, see CP 1.366 and CP 6.23. Peirce discusses the composition of forces, 
that is, how the components are combined using the parallelogram law. 

"If two forces are combined according to the parallelogram of forces, their resul- 
tant is a real third... Thus, intelligibility, or reason objectified, is what makes 
Thirdness genuine.". 
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"A physical law is absolute. What it requires is an exact relation. Thus, a 
physical force introduces into a motion a component motion to be combined 
with the rest by the parallelogram of forces;". 

In order to establish a minimal mapping, there are two more concepts in CogCon+FBST 
to which we must assign adequate analogs in Peircean philosophy. 

6.4 Extra variability: In Chapter 1 the importance of incorporating all sources of noise 
and fluctuation, i.e., all the extra variability statistically significant to the problem under 
study, into the statistical model is analyzed. The following excerpt from CP 1.175 indi- 
cates that Peirce's notion of fahbiUism may be used to express the need for allowing and 
embracing all relevant (and in practice inevitable) sources of extra variability. According 
to Peirce, falibilism is "the doctrine that there is no absolute certainty in knowledge". 

"There is no difficulty in conceiving existence as a matter of degree. The 
reality of things consists in their persistent forcing themselves upon our recog- 
nition. If a thing has no such persistence, it is a mere dream. Reality, then, 
is persistence, is regularity. ... as things (are) more regular, more persistent, 
they (are) less dreamy and more real. Fallibilism will at least provide a big 
pigeon-hole for facts bearing on that theory. " 

6.5 - Bayesian statistics: FBST is an Unorthodox Bayesian statistical formalism. 
Peirce has a strong and unfavorable opinion about Laplace's theory of inverse probabilities. 

"...the majority of m,athem,atical treatises on probability follow Laplace in re- 
sults to which a very unclear conception of probability led him. ... This is an 
error often appearing in the books under the head of 'inverse probabilities'." 
CP 2.785. 

Due to his theory of inverse probabilities, Laplace is considered one of the earliest 
precursors of modern Bayesian statistics. Is there a conflict between CogCon+FBST and 
Peirce's philosophy? We believe that a careful analysis of Peirce arguments not only 
dissipates potential conflicts, but also reinforces some of the arguments used in Chapter 
1. 

Two main arguments are presented by Peirce against Laplace's inverse probabilities. 
In the following paragraphs wc will identify these arguments and present an up-to-date 
analysis based on the FBST (unorthodox) Bayesian view: 

6.5a - Dogmatic priors vs. Symmetry and McLximum Entropy arguments: 

"Laplace maintains that it is possible to draw a necessary conclusion regarding 
the probability of a particular determination of an event based on not knowing 
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anything at all [about it]; that is, based on nothing. ... Laplace holds that 
for every man there is one law (and necessarily but one) of dissection of each 
continuum of alternatives so that all the parts shall seem to that man to be 
'egalement possibles' in a quantitative sense, antecedently to all information.", 
CP 2.764. 

The dogmatic rhetoric used at the time of Laplace to justify ad hoc prior distribu- 
tions can easily backfire, as it apparently did for Peirce. Contemporary arguments for the 
choice of prior distributions are based on MaxEnt formalism or symmetry relations, see 
Dugdale (1996), Eaton (1989), Kapur (1989) and Nachbin (1965). Contemporary argu- 
ments also examine the initial choice of priors by sensitivity analysis, for finite samples, 
and give asymptotic dissipation theorems for large samples, see DeGroot (1970), Gelman 
et al. (2003) and Stern (2004). We can only hope that Peirce would be pleased with 
the contemporary state of the art. These powerful theories have rendered ad hoc priors 
unnecessary, and shed early dogmatic arguments into oblivion. 

6.5b- Assignment of probabilities to (sharp) hypotheses vs. FBST possibilistic 
support structures: 

"Laplace was of the opinion that the affirmative experiments impart a defi- 
nite probability to the theory; and that doctrine is taught in most books on 
probability to this day, although it leads to the most ridiculous results, and is 
inherently self-contradictory. It rests on a very confused notion of what prob- 
ability is. Probability applies to the question whether a specified kind of event 
will occur when certain predetermined conditions are fulfilled; and it is the ra- 
tio of the number of times in the long run in which that specified result would 
follow upon the fulfillment of those conditions to the total number of times in 
which those conditions were fulfilled in the course of experience.", CP 5.169. 

In the second part of the above excerpt Peirce expresses a classical (frequentist) under- 
standing of having probability in the sample space, and not in the parameter space, that 
is, he admits predictive probability statements but does not admit epistemic probability 
statements. The FBST is a Bayesian formahsm that uses both predictive and epistemic 
probability statements, as explained in Chapter 1. However, when we examine the reason 
presented by Peirce for adopting this position, in the first part of the excerpt, we find a 
remarkable coincidence with the arguments presented in Stern (2003, 2004, 2006, 2007) 
against the orthodox Bayesian methodology for testing sharp hypotheses: The FBST does 
not attribute a probability to the theory (sharp hypothesis) being tested, as do orthodox 
Bayesian tests, but rather a degree of possibility. In Stern (2003, 2004, 2006, 2007) we 
analyze procedures that attribute a probability to a given theory, and came to the exact 
same conclusion as Pierce did, namely, those procedures are absurd. 
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6.6 Measure Theory: Let us now return to the Peircean concept of Synechism, to 
discuss a technical point of contention between orthodox Bayesian statistics and the FBST 
unorthodox Bayesian approach. The FBST formahsm relies on some form of Measure 
theory, see comments in section 3. De Finetti, the founding father of the orthodox school 
of Bayesian statistics, feels very uncomfortable having to admit the existence of non- 
measurable sets when using measure theory in dealing with probabilities, in which valid 
statements are called events, see Finetti (1975, 3.11, 4.18, 6.3 and appendix). Dubins 
and Savage (1976, p. 8) present similar objections, using the colorful gambling metaphors 
that are so characteristic of orthodox (decision theoretic) Bayesian statistics. In order to 
escape the constraint of having non-measurable sets, de Finetti (1975, v. 2, p. 259) readily 
proposes a deal: to trade off other standard properties of a measure, like countable (cr) 
additivity: 

"Events are restricted to be merely a subclass (technically a a -ring with some 
further conditions) of the class of all subsets of the base space. In order to make 
a-additivity possible, but without any real reason that could justify saying to 
one set 'you are an event', and to another 'you are not'." 

In order to proceed with our analysis, we have to search for the roots of de Finetti's 
argument, roots that, we believe, lay outside de Finetti's own theory, for they hinge on 
the perceived structure of the continuum. Bell (1998, p. 2), states: 

"the generally accepted set-theoretical formulation of mathematics (is one) in 
which all mathematical entities, being synthesized from collections of individu- 
als, are ultimately of a discrete or punctate nature. This punctate character is 
possessed in particular by the set supporting the 'continuum ' of real numbers 
- the 'arithmetical continuum'." 

Among the alternatives to arithmetical punctiform perspectives of the continuum, 
there are more geometrical perspectives. Such geometrical perspectives allow us to use an 
arithmetical set as a coordinate (localization) system in the continuum, but the 'ultimate 
parts' of the continuum, called infinitesimals, are essentially nonpunctiform, i.e. non point 
like. Among the proponents of infinitesimal perspectives for the continuum one should 
mention G.W.Leibniz, I.Kant, C.S.Peirce, H.Poincare, L.E.J.Brouwer, H.Weyl, R.Thom, 
F.W.Lawvere, A.Robinson, E.Nelson, and many others. Excellent historical reviews are 
presented in Bell (1998 and 2005), a general view, and Robertson (2001), for the ideas of 
C.S.Peirce. In the infinitesimal perspective, see Bell (1998, p. 3), 

"any of its (the continuum) connected parts is also a continuum and, accord- 
ingly, divisible. A point, on the other hand, is by its nature not divisible, and 
so (as stated by Leibniz) cannot be part of the continuum." 
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In Pcirce doctrine of synechism, the infinitesimal geometrical structure of the con- 
tinuum acts like " the 'glue' causing points on a continuous line to lose their individual 
identity.", see Bell (1998, p. 208, 211). According to Peirce, " The very word continuity 
implies that the instants of time or the points of a line are everywhere welded together. " 

De Finetti's argument on non-measurable sets implicitly assumes that all point subsets 
of i?" have equal standing, i.e., that the continuum has no structure. Under the arithmeti- 
cal punctiform perspective of the continuum, de Finetti's objection makes perfect sense, 
and we should abstain from measure theory or alternative formalisms, as does orthodox 
Bayesian statistics. This is how Peirce's concept of synechism helps us to overcome a 
major obstacle (for the FBST) presented by orthodox Bayesian philosophy, namely, the 
objections against the use of measure theory. 

At this point it should be clear that my answer to Brier's question is emphatically 
affirmative. From Brier's comments and suggestions it is also clear how well he knew the 
answer when he asked me the question. As a maieutic teacher however, he let me look 
for the answers my own way. I can only thank him for the invitation that brought me for 
the first time into contact with the beautiful world of semiotics and Peircean philosophy. 

2.7 Final Remarks 

The physician Rambam, Moshe ben Maimon (1135-1204) of (the then caliphate of) Cor- 
doba, wrote Shmona Perakim, a book on psychology (medical procedures for healing the 
human soul) based on fundamental principles exposed by Aristotle in Nicomachean Ethics, 
see Olitzky (2000) and Rackham (1926). Rambam explains how the health of the human 
soul depends on always finding the straight path (derech y'shara) or golden way (shvil 
ha-zahav), at the perfect center between the two opposite extremes of excess (odef) and 
scarcity (choser), see Maimonides (2001, v.l: Knowledge, ch.2: Temperaments, sec. 1,2): 

"The straight path is the middle one, that is equidistant from both extremes.... 
Neither should a man he a clown or jokester, nor sad or mourning, but he 
should be happy all his days in serenity and pleasantness. And so with all the 
other qualities a man possesses. This is the way of the scholars. Every man 
whose virtues reflect the middle, is called a chacham... a wise man." 

Rambam explains that a (always imperfect) human soul, at a given time and situation, 
may be more prone to fall victim of one extreme than to its opposite, and should try to 
protect itself accordingly. One way of achieving this protection is to offset its position in 
order to (slightly over-) compensate for an existing or anticipated bias. 

At the dawn of the 20th century, humanity had in classical physics a paradigm of 
science handing out unquestionable truth, and faced the brutality of many totalitarian 
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states. Dogmatism had the upper hand, and we had to protect ourselves accordingly. 

At the beginning of the 21st century we are enjoying the comforts of an hyperactive 
economy that seems to be blind to the constraints imposed by our ecological environment, 
and our children are being threatened by autistic alienation through the virtual reality of 
their video games. It may be the turn of (an apathetic form of) solipsism. 

Finally, Rambam warns us about a common mistake: Protective offsets may be a 
useful precautionary tactic, or even a good therapeutic strategy, but should never be 
considered as a virtue per se. The virtuous path is the straight path, neither left of it nor 
right of it, but at the perfect center. 
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Chapter 3 



Decoupling, Randomization, 
Sparsity, and Objective Inference 

"The light dove, that at her free flight cleaves the air, 
therefore feeling its resistance, could perhaps imagine 
that she would succeed even better in the empty space. " 

Immanuel Kant (1724-1804), 
Critique of Pure Reason (1787, B-8). 

Step by step the ladder is ascended. 

George Herbert (1593 - 1633), 
Jacula Prudentium (1651). 



3.1 Introduction 

H.von Foerster characterizes "known" objects as eigen-solutions for an autopoietic system, 
that is, as discrete (sharp), separable (decoupled), stable and composable states of the 
interaction of the system with its environment. Previous chapters have presented the Pull 
Bayesian Significance Test (FBST) as a mathematical formalism specifically designed to 
access the support for sharp statistical hypotheses, and have shown that these hypotheses 
correspond, from a constructivist perspective, to systemic eigen-solutions in the practice 
of science, as seen in chapter 1. In this chapter, the role and importance of one of these 
four essential attributes indicated by von Foerster, namely, separation or decoupling, is 
studied. 
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Decoupling is the general principle that allows us to understand the world step by 
step, 'looking' at it a piece at a time, localizing single features, isolating basic components 
or identifying simple objects, out of the immense complexity of the whole universe. In 
statistical models, decoupling is often introduced by means of no association assumptions, 
such as independence, zero covariance, etc. In this context, decoupling relations are 
sharp statistical hypotheses that can be tested, see for example Stern and Zacks (2002). 
Decoupling relations in statistical models can also be introduced a priori by means of 
special Design of Statistical Experiments (DSEs) techniques, the best known of which 
being randomization. 

In chapter 2 the general meaning of the term "Objective" (how, less, more) is defined 
as the "degree of conformance of an object to the essential attributes of an eigen-solution" . 
One of the common uses of the word objective, as opposed to "subjective", stresses the 
decoupling or separation of a given systemic eigen-solution, such as an object of a scientific 
program, from the peculiarities of a second system, such as a specific human observer. It 
is this restricted meaning, focusing on the decoupling property of systemic eigen-solutions, 
that justifies the use of the term objective in this chapter's title. 

The decoupling principle, and one of its most celebrated examples in Physics, the 
vibrating chord, are presented in section 2. In the vibrating chord model, a basic lin- 
ear algebra operation, the eigen-value factorization, is the key to obtain the decoupling 
operator. In addition, the importance of eigen-solutions and decoupling operations are 
discussed from a constructivist epistemological perspective. Herein, we shall focus on de- 
coupling operators related to an other basic linear algebra operation, namely, the Cholesky 
factorization. In section 3 we show how Cholesky factorization can be used to decouple 
covariance structure models. In section 4, Simpson's paradox and some strategies for 
DSEs, such as control and randomization, are discussed. These strategies can be used to 
induce independence relations, that are expressed into the sparsity structure of the model, 
which can, in turn, be used for efficient decoupling. In section 5, the role of C.S.Peirce 
in the introduction of control and randomization in DSEs is reviewed from an histori- 
cal perspective. This revision will help us set the stage for the discussion, in section 6, 
of a controversial issue: randomization in Bayesian Statistics. In section 7 some episte- 
mological consequences of randomization, are discussed and the underlying themata of 
constructivism and objective knowledge are revisited. 

The Cholesky factorization operator is presented in section 3, in conjunction with 
the computational concepts of sparse and structured matrices. Covariance structure and 
Bayesian networks are some of the most basic and widely used statistical models. There- 
fore, understanding their decoupling properties is important, not only from a compu- 
tational point of view, but also from the theoretical and a epistemological perspective. 
Furthermore, one could argue that the usefulness of these statistical models are due ex- 
actly to their decoupling properties. Final remarks are presented in section 8. 
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3.2 The Decoupling Principle 

Understanding the entire universe, with all its intricate constituents, relations and inter- 
connections, can be a daunting task, as stated by Schlick (1979, v.l, p. 292): 

" The most important (of these) difficulties arises from the recognition of the 
unending linkage of all natural processes one with another. Its effect is that, 
on an exact view, every occurrence in the world is dependent on every other; 
the fall of a leaf is ultimately influenced by the motions of the stars, and it 
would be a task utterly beyond fulfillment to assign its 'cause ' with absolute 
completeness to any given process that we suppose determined down to the last 
detail. For this purpose we should have to adduce nothing less than all of the 
circumstances of the universe that have so far occurred. 

Now fortunately this boundlessness is at once considerably restricted by expe- 
rience, which teaches us that the reciprocal interdependence of all events in 
nature is subject to certain easy formulable conditions. " 

L.Sadun has written an exceptionally clear book on linear algebra, emphasizing the 
idea of decoupling, i.e. the strategy of breaking down complicated multivariate systems 
into simple 'modes', by a suitable change of coordinates, see also Rijsbergen (2004). Sadun 
(2001, p.l) states the goal of his book as follows: 

"In this book we cover a variety of linear evolution equations, beginning with 
the simplest equations in one variable, moving to coupled equations in several 
variables, and culminating in problems such as wave propagation that involve 
an infinite number of degrees of freedom. Along the way we develop techniques, 
such as Fourier analysis, that allow us to decouple the equations into a set of 
scalar equations that we already know how to solve. 

The general strategy is always the same. When faced with coupled equations 
involving variables Xi, . . . , Xn, we define new variables Vi, ■ ■ ■ ,yn- These vari- 
ables can always be chosen so that the evolution of yi depends only of yi (and 
not on y2, . . . ,yn), the evolution 0/1/2 depends only of y2, and so on. To find 
Xi{t), . . . ,Xn{t) in terms of the initial conditions a;i(0), . . . ,x„(0), we convert 
x{0) to y{0), then solve for y{t), then convert to x{t). 

As an example of paramount theoretical and historical importance in Physics, we 
consider the discrete chord. The chord is kept at tension /i, with n particles of mass 
m at equally spaced positions js, j = 1 . . .n. The extremes of the chord, at positions 
and {n + l)s, are kept fixed, and x — [xi,X2, ■ ■ ■ ,Xn]' denote the particles' vertical 
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Figure 1: Eigen-Solutions of Continuous and Discrete Chords. 

displacements, see French (1974, ch.5 Coupled oscillators and normal modes, p. 119-160), 
Marion (1999, ch.9) and Franklin (1968, ch.7). Figure 1 shows the discrete chord for n — 2. 

The second order differential equation of classical mechanics, below, privides a linear 
approximation for the discrete chord system's dynamics: 
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As it is, the discrete chord differential equation is difficult to solve, since the n coor- 
dinates of vector x are coupled by matrix K. In the following paragraphs we show how 
to decouple this differential equation. 

Suppose that an orthogonal matrix Q is known to diagonalize matrix K, that is, 
Q"^ = Q' , and Q'KQ = D = diag((i), d = [di,d2, ■ ■ ■ ydn]'- After pre-multiplying the 
above differential equation by Q', we obtain the matrix equation 



Q'{Qy) + Q'K{Qy) = Iy + Dy^O 
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which is equivalent to the n decoupled scalar equations for harmonic oscillators, yk+dkUk = 
0, in the new 'normal' coordinates, y = Q'x. The solution of each harmonic oscillator, as 
a function of time, t, has the form yk{t) = sm{ipk + Wkt), with phase < (^fe < 27r and 
angular frequency Wk = y/dk- 

The columns of matrix Q, the decoupling operator, are the eigenvectors of matrix 
K, which are, as one can easily check, multiples of the un-normalized vectors z^. Their 
corresponding eigenvalues, dk = wl, for j,k — 1 .. . n, are given by 



The decoupled modes of oscillation, for n — 2, are depicted in Figure 1. They are 
called 'normal' modes in physics, 'standing' modes in engineering, and eigen-solutions in 
mathematics. The discrete chord with n particles will have n normal modes, and the 
limit case, n — > oo, is called the continuous chord. The normal modes of the continuous 
chord are given by trigonometric functions, the first few of which are depicted in Figure 
1. They are also called 'standing' waves or eigen-functions of the chord, and constitute 
the basis of Fourier analysis. 

In either the discrete or the continuous chord, we can 'excite', i.e. give energy or 'put 
in motion', one of the normal modes, without affecting any other normal mode. This 
is the physical meaning of decoupling, i.e. to have 'separate' eigen-solutions. Since the 
differential equation describing the system is linear, distinct normal modes can also be su- 
perposed. This is called the 'superposition' principle, which renders the compositionality 
rule for the eigen-solutions of the chord. 

In the original coordinate system, x, coupling made it hard to follow the system's 
evolution. In the normal coordinate system, y, based on the system's eigen-solutions, 
decoupling and superposition made it easier to understand the system behavior. But are 
these eigen-solutions "just" a formal basis for an alternative coordinate system, or do they 
represent "real objects" within the system under study? 

Obviously, this is not a mathematical or physical question, but rather an epistemo- 
logical one. From a constructivist perspective, we can consider these eigen-solutions 
"objectively known" entities in the system. Nevertheless, the meaning of the term ob- 
jective in a constructivist epistemology is distinct from its meaning in a dogmatic realist 
epistemology, as explained in Stern (2006b, 2007a,b). 

From a constructivist perspective, systemic eigen-solutions can be identified and "named" 
by an observer. Indeed, the eigen-solutions of the vibrating chord have been identified 
and named thousands of years before mankind knew anything about differential equa- 
tions. The eigen-values of the chord are known in music as the 'fundamental tone' and 
its 'higher harmonics', and constitute the basis for all known musical systems, see Benade 




(1992). 
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The linear model for the vibrating chord is a paradigmatic example of the fact that, 
despite the simplicity to understand and manipulate, linear models often give excellent 
approximations for complex systems. Also, since linear operators are represented by ma- 
trices in standard matrix algebra, the importance of certain matrix operations in the 
decoupling of such models should not be surprising at all. In the vibrating chord model, 
the eigen- value factorization, K — QDQ', was the key to obtain the decoupling opera- 
tor, Q. The eigen-value factorization plays the same role in many important statistical 
procedures, such as spectral analysis of time series, wavelet signal analysis, and kernel 
methods. 

Related operations of linear algebra, like Singular Value Decomposition, SVD, and 
Nonnegative Matrix Factorizations, NNMF, are important in principal components anal- 
ysis and latent structure models, see for example Bertsekas and Tsitsiklis (1989), Cen- 
sor and S.A.Zenios (1998), Cichocki et al. (2006), Dhillon and Sra (2005) and Hoyer 
(2004). Distinct decoupling operators have distinct characteristics, relying upon stronger 
or weaker structural properties of the model, requiring more or less computational work, 
and having different capabilities for handling sparse data. 

In this chapter, we will be mainly interested in the decoupling of statistical models. 
More precisely, we shall focus on decoupling methods related to an important basic linear 
algebra operation, namely, the Cholesky factorization. In the next section we show how 
Cholesky factorization can be used to decouple covariance structure statistical models. 

The decoupling principle emerges, sometimes with different denominations, in virtually 
every area of the hard sciences. In Systems Theory and Mathematical Programming, for 
example, it arises under the name of Decomposition Methods. In the optimization of 
large systems, for example, there are two basic approaches to decomposition: 

- High level methods focus on the underlying structure of the optimization problems. 
High level decomposition strategies replace the original large or complex problem by sev- 
eral hierarchically interconnected small or simple optimization problems, see for example 
Geoffrion (1972), Lasdon (1970) and Wismer (1971). 

- Low level methods look at the matrix representation of the optimization problems. 
Low level decomposition strategies benefit from tailor made computational linear algebra 
subroutines to take advantage of the underlying sparse matrix structure. Some of these 
techniques are discussed in the next section. 

3.3 Covariance Structure Models 

Covariance structure, multivariate regression, Kalman filter and several other related 
linear statistical models are widely used in the practice of science. They provide a powerful 
analytical tool in which the association, coupling or dependence between multiple variables 
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is represented by covariance matrices, as briefly noted in the next paragraphs. These 
models are simple to manipulate and interpret, and can be implemented using eflicient 
computational algorithms capable of handling millions of (sparsely coupled) variables. In 
this and the next sections, it is shown how such desirable characteristics of covariance 
models ultimately rely upon some basic properties of its decoupling operators. 

Given a (vector) random variable, x, its covariance matrix, V, is defined as the ex- 
pected square distance to its expected (mean) value, f3, that is, 

/3 = E(x) , V = Cov(x) = E((x - /3) (g) (x - /?)') . 

The diagonal elements, or variances, Var(a;j) = V^^j, give the most usual scalar measure 
of error, dispersion or uncertainty used in statistics, while the off diagonal elements, 
Cov{xi,Xj) — Vij, give a measure of association between two scalar random variables, Xi 
and Hocking (1985) for a general reference. 

Also recall that since the expectation operator, E, is linear, that is, E{Ax -\- h) — 
AE{x) + h for any random vector x, matrix A and vector 6, we have 

Coy{Ax + h) = ACoy{x)A . 

The standard deviation, cTj = y^T^, is a dispersion measure given in the same unit 
as X, and the correlation, Cjj = Vij/aiaj, is a measure of association normalized in the 
[— 1, 1] interval. 

As it is usual in the covariance structure literature, we can write a covariance matrix 
as V(7) = ^7tG*, in which the matrices G* constitute a basis for the space of symmetric 
matrices of dimension nxn, see Lauretto et al. (2002). For example, for dimension n — A, 

75 77 78 

72 79 7io 

79 73 76 

7io 76 74 

Using the above notation, we can easily express hypotheses concerning structural prop- 
erties, including sparsity patterns, in the standard form of vector functional equations, 
h{/3,'y) = 0. Details on how to use the FBST to test such general hypotheses in some 
particular settings can be found in Lauretto et al. (2002). 

Once we have established the structural properties of the model, we can estimate 
the parameters /3 and 7 accordingly. Following the general line of investigation adopted 
herein, a question that arises naturally is: How can we decouple the estimated model? 

One possible answer to this question can be given in terms of the Cholesky factoriza- 
tion, LL' = V where L is lower triangular. Such a factorization is available for any full 
rank symmetric matrix V, as shown in Golub and van Loan (1989). Let V — LL' be the 
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Cholesky factorization of the covariance matrix, V, and let us consider the transformation 
of variables y = L^^x, or x = Ly. The covariance matrix of the new variables can be 
computed as Cov(y) = L~^V"L~* = L~^LL'L~^ = I. Hence, the transformed model has 
been decoupled, i.e., has uncorrelated random components. 

Let us consider a simple numerical example of Cholesky factorization: 
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V = LL' 



This example of Cholesky factorization has some peculiarities: The matrix V is sparse, 
i.e., it has several zero elements. In contrast, a matrix with few or no zero elements is 
said to be dense. Matrix V in the example is also structured, i.e., the zeros are arraged in 
a nice pattern, in this example, a 2 x 2 off diagonal block. In this example, the Cholesky 
factor, L \ LL' = V, preserves the sparsity and structure of V, that is, no position with a 
zero in V is filled with a non-zero in L. A factorization (or ellimination) resulting in no fill 
in is called perfect. Perfect eliminations are not always possible, however, there are several 
techniques that can be Tiscd to obtain sparse (and structured) Cholesky factorizations in 
which the fill in is minimized, that is, the sparsity of the Cholesky factor is maximized. 
Pertinent references on sparse factorizations include Blair and B.Peyton (1993), Bunch 
and D.J.Rose (1976) George et al. (1978, 1981, 1989, 1993), Golumbic (1980), Pissanetzky 
(1984), Rose (1972), Rose and Willoughby (1972), Stern (1992,1994), Stern and Vavasis 
(1993,1994) and van der Vorst and van Dooren (1990). 

Large models may have millions of sparsely coupled variables. A sparse and structured 
factorization of such a model gives a 'simple' decoupling operator, L. This is a matter 
of vital importance when designing efficient computational procedures. In practice, large 
models can only be computed with the help of these techniques. An other important class 
of statistical models, Bayesian Networks, rehes on sparse factorization techniques that, 
from an abstract graph theoretical perspective, are almost identical to sparse Cholesky 
factorization, see for example Lauritzen (2006) and Stern (2006a, sec. 9-11). 

In the next section we continue to examine the role of covariance, or more general 
forms of association, in statistical modeling. On particular, we examine some situations 
leading to spurious associations, destroying a model's presumed sparsity and structure. 
In the following sections we review, from an historical and epistemological perspective, 
some techniques of Design of Statistical Experiments (DSE), used to induce (no) associa- 
tion relations in statistical models. These relations translate into sparsity and structural 
patterns that, in turn, can be used by efficient factorization algorithms. 
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3.4 Simpson's Paradox and the Control of Confound- 
ing Variables 

Lindley (1991, p.47-48) illustrates Simpson's paradox with a medical trial example. Prom 
80 patients in the study, 40 received treatment, T, and 40 received a placebo with no 
effect, NT. Some patients recovered from their illness, R, and some did not, NR. The 
recovery rates, R%, are given in Table 1, where the experimental data is shown, both in 
aggregate form for All patients, and separated or disaggregated according to Sex. Looking 
at the table one concludes that the treatment is bad for either male or female patients, but 
good for all of them together! This is the Simpson's Paradox: The association between 
two variables, T and R in Lindley's example, is reversed if the data is aggregated / 
disaggregated over a confounding variable. Sex in Lindley's example. 



Tal)le 1: Sini]:)Son's Paradox. 
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Pem 


NT 
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21 


30 


30% 



Lindley provides the following scenario for the situation illustrated by this example: 
The physician responsible for the experiment did not trust the treatment and also was 
aware that the illness under study affects females most severely. Hence, he decided to 
try it mainly on males, who would probably recover anyway. This illustrates the general 
Simpson's paradox situation, generated by the association of the confounding variable with 
both the explained and one (or more) of the explaining variables. Additional references 
on several aspects related to the Simpson paradox include Blyth (1972), Cobb (1998), 
Good and Mittal (1987), Gotzsche (2002), Greenland et al. (1999, 2001), Heydtmann 
(2002), Hinkelmann (1984), Pearl (2004) and Reintjes et al. (2000). 

The obvious question then is: How can we design a statistical experiment in order to 
avoid spurious associations? 

Two strategies are self-evident: 

1. Control possible confounding variables in order to impose some form of invariance 
(constancy, equality) in the experiment, or 

2. Measure possible confounding variables so that the relevant ones can be included in 
the statistical model. 
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The simplest form of the first strategy would be to test the treatment in a set of 
'clones', individuals that are, using the words of Fisher (1966, sec. 9, Randomization; the 
Physical Basis of Validity of the Test, p. 17-19), 

"exactly alike, in every respect except that to be tested", 

This strategy, however, is too strict. Even if feasible, the conclusions of the study would 
only apply to the 'clone population', not to individuals from a population with natural 
variability. 

A more general form of the first strategy in known as blocking, defined in Box et al. 
(1978, p. 102-103, Sec.4.3, Blocking and Randomization) as: 

"The device of pairing observations is a special case of 'blocking' that has 
important applications in many kinds of eocperiments. A block is a portion of 

the experimental material (the two shoes of one boy in this example) that is 
expected to be more homogeneous than the aggregate (all shoes of all the boys). 
By confining treatment comparisons within such blocks, greater precision can 
often be obtained. " 

Blocking is a very important strategy in the design of statistical experiments (DSEs), 
used to increase, whenever possible, the precision of the study's conclusions. 

As for the second strategy, it looks a sure thing! No statistician would ever refuse 
more information, in a larger and richer data bank. 

Nevertheless, we have to ask whether we want to control and/or measure SOME of 
the possibly confounding variables, i.e. those perceived as the most important or even 
those we are aware of, or ALL of them? 

Keeping everything under control in a statistical experiment (or in life in general) 
constitutes, in the words of Fisher, 

"a totally impossible requirement in our example, and equally in all other forms 
of experimentation". 

Not only the cost and complexity of trying to do so for a very large set of variables 
would be prohibitive in any practical circumstance, but also 

"it would be impossible to present an exhaustive list of such possible differences 
(variables) appropriate for any one kind of experiment, because the uncon- 
trolled causes which may influence the result are always strictly innumerable". 
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Modern theory of DSEs offers a way out of this conundrum that, in its most concise 
form, see Box et al. (1978, p. 102-103), can be stated as: 

- Control what you can, and randomize what you can not. 

Randomization, as defined by Hacking (1988, p. 428), is 

"(the) notion of random assignment of treatment to a subset of the plots or 
persons, leaving the rest as controls. ... I shall speak of an experiment using 
randomization in this way as involving a randomized design. ... 
... There is a related but distinguishable idea of (random) representative sam- 
pling. " 

As it is usual in the statistical literature. Hacking distinguishes between two intended 
uses of randomization, namely random design and random sampling. Random design 
aims to eliminate bias coming from systematic design problems, including several forms 
of uncontrolled influence, either conscious or unconscious, received from and exerted by 
agents participating in the experiment. Random sampling, on the other hand, is intended 
to justify, somehow, assumptions concerning the functional form of a distribution in the 
statistical model of the experiment. The distinction between random design and random 
sampling will be kept here, even though, as briefly mentioned in section 6, a deeper 
probabilistic analysis of randomization shows that, from a theoretical point of view, the 
two concepts can greatly overlap. 

Our immediate interest in randomization (and control) is on whether it can assist the 
design of experiments by inducing independence relations. This strategy is pinpointed 
in the following quote from Pearl (2000, p. 340,348. Epilogue: The Art and Science of 
Cause and Effect): 

"...Fisher's 'randomized experiment'... consists of two parts, 'randomization' 
and intervention'. " 

"Intervention means that we change the natural behavior of the individual: 
we separate subjects into two groups, called treatment and control, and we 
convince the subjects to obey the experimental policy. We assign treatment 

to some patients who, under normal circumstances, will not seek treatment, 
and give placebo to patients who otherwise would receive treatment. That, 
in our new vocabulary, means 'surgery' - we are severing one functional link 
and replacing it with another. Fisher's great insight was that connecting the 
new link to a random coin flip 'guarantees ' that the link we wish to break is 
actually broken. The reason is that a random coin is assumed to be unaffected 
by anything we can measure on macroscopic level..." 
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3.5 C.S.Peirce and Randomization 

We believe that many fine points about the role of randomization in the DSEs can be 
better understood by following its development from an historical perspective. This is 
the topic of this section. 

In the period of 1850 to 1880 the quantitative analysis of human sensation in response 
to physical (tactile, acoustic or visual) stimuli, was the main goal of 'psychophysics'. A 
typical hypothesis in this research program was Fechner's law, see Hernstein and Boring 
(1966, p.72), which stated that, 

"The magnitude of sensation ('j) is not proportional to the absolute value of 
the stimulus (f3), but rather to the logarithm of the magnitude of the stimulus 
when this is expresses in terms of its threshold value (b), i.e. that magnitude 
considered as unit at which the sensation begins and disappears. " 

In modern mathematical notation, 7 = k\og{f5/b) > b). 

In his psychophysical experiments Fechner tested his own ability to distinguish the 
strongest in a pair of stimuli. For example, he would prepare two objects of masses /j, 
and fi + 6, and later on he would lift them, and 'answer' which one appeared to him to 
be the heaviest. A quantitative analysis would latter relate the proportion of right and 
wrong answers with the values of n and 6, see Stigler (1986, ch.7, Psychophysics as a 
Counterpoint, p. 239-261). Fechner was well aware of the potential difficulties resulting 
from the fact that the experiments where not performed blindly, that is, since he prepared 
the experiment himself, he could know in advance the right answer. Nevertheless, he 
claimed to be able to control himself, be objective, and overcome this difficulty. 

According to Dehue (1997), in the decade of 1870, G.E.Miiller and several researchers 
at Tiibingen and Gottingen Universities, began to improve the design of psychophysical 
experiments. The first major improvement was blinding: the stimuli were prepared or 

administered by an 'Experimenter' or 'Operator' and applied to a distinct person, the 
'Observer', 'Patient' or 'Subject', who was kept unaware of the actual intensity of the 
stimuli. 

The second major improvement was the precaution of presenting the stimuli in 'ir- 
regular order' (buntem Wechsel). This irregularity was introduced to prevent the patient 
from becoming habituated to patterns in the sequence of stimuli presented to him or, in 
other words, to keep him to form building expectations and guessing the right answers. 
Nevertheless, there was, at that time, neither a general theory defining 'irregularity', nor 
a systematic method for providing an 'irregular order'. 

In 1885, Charles Saunders Peirce and his student Joseph Jastrow presented random- 
ization as a practical solution, in this context, to the question of irregularity, that is. 
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systematic randomization should prevent any effective guessing by the patient, see Hack;- 
ing (1988, III. Psychophysics: Peirce's at Work, p. 431-434). Peirce was in fact insisting 
on 'exchangeabihty', a key notion in the analysis of randomization in modern statistics 
and, most specially, in Bayesian statistics, that will be discussed in the next section. 

Peirce also struggled with the dilemma of allowing or not, in the course of the ex- 
periment, sequences that do not 'appear' to be random. His conclusions, see Peirce and 
Jastrow (1884, p. 122), are, once more, precursors to De Finetti's concept of exchangeabil- 
ity: 

"The pack (of playing-cards) was well shuffled, and, the operator and sub- 
ject having taken their places, the operator was governed by the color of the 
successive cards ... 

A slight disadvantage in this mode of proceeding arises from the long runs of 
one particular kind of change, which would occasionally be produced by chance 
and would tend to confuse the mind of the subject. But it seems clear that 
this disadvantage was less than that which would have been occasioned by his 
knowing that there would be no such long runs if any means had been taken to 
prevent them. " 

Regardless of its importance, Peirce's solution of randomization was not accepted by 
his contemporaries, fell into oblivion, and was almost forgotten, until it reappeared much 
latter in the work of R.A.Fisher. We believe that there are several entangled reasons to 
explain such a twisted historical process. The psychopysics community raised objections 
against some of the hypotheses, and also against some methodological aspects presented 
in Peirce's paper. Besides, there is also a confounding factor generated by a second 
role played by randomization in Peirce's paper, namely, 'randomization to measure faint 
effects'. We shall briefly discuss these aspects in the next paragraphs. 

Fechner assumed the existence of a threshold (Schwelle), b, bellow which small differ- 
ences could no longer be discerned. Peirce wanted to refute the existence of this threshold 
assuming, instead, a continuously decreasing sensitivity to smaller and smaller differences. 
We should remark that for Peirce this should not have been a fortuitous hypothesis, since 
it can be related to his general philosophical ideas, most specially with the concept of 
synechism, see chapter 2, Hartshorne et al. (1992) and Eisele (1976). 

Peirce postulated that the patients' sensitivity could be adequately measured by the 
probability of correct answers, even when the difference was too faint to be consciously 
discerned by the same patients. Hence, in experiments similar to Fechner's, Peirce asked 
the patient always to guess the correct answer. Peirce also asked the patient to give 
the answer a confidence score from to 3. Peirce analyzed his experimental data and 
derived empirical formulae relating the (rounded) 'subjective' confidence scores, m, and 
the 'objective' probabihty of correct answers, p, as in Peirce and Jastrow (1884, p. 122): 
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" The average marks seem to conform to the formula m = clog(;7/(l — p)), 
where m denotes the degree of confidence on the scale, p denotes the probability 
of the answer being right, and c is a constant which may be called the index of 
confidence. " 

At the time of Peirce's experiments, the psychophysical community gave great impor- 
tance to the analysis of the patient's subjective 'introspections'. According to this view, 
Peirce's experiments were criticized by asking the patient to guess the correct answer 
even when he expressed low confidence. Of course, if one understands Peirce's research 
program, it is clear that that the experimental design he used is perfectly coherent. Un- 
fortunately, this was not the judgment of his contemporaries. 

The same techniques and experimental designs used by Peirce were subsequently used 
by several researchers in attempts to measure faint effects, including effects produced by 
'below the consciousness threshold', sub-conscious, or sub-liminal stimuli. Some of these 
studies were really misconceived, and that may have been yet another contributing factor 
for the reactions against the use of randomization. Whatever the explanation might be, 
Peirce's paper fell into oblivion, and the progress of DSEs was delayed by half a century. 

3.6 Bayesian Analysis of Randomization 

The work of Ronald Aylmcr Fisher can undoubtedly be held responsible for disseminating 
the modern approach to DSEs, including randomization, to almost any area of empirical 
research, see for example Fisher (1926, 1935). The idea of randomization, however, was 
later contested by some members of the Bayesian school. Commenting on the use of 
randomization after Fisher, Hacking (1988, p. 429-430), states: 

"Undoubtedly Fisher won the day, at least for the following generation, but 
then a new, although not completely unrelated, challenge to randomized design 
arose. This came from the revival of the 'Bayesian' school, typically associ- 
ated with L.J. Savage's theory of what he called personal probability. Here the 
object is to form an initial assessment of one's personal beliefs about a subject 
and to modify them in the light of experience and a theoretical analysis for- 
mally modeled by the calculus of probability and a theory of personal utility. 
It is widely held to be an almost immediate consequence of this approach that 
randomization is of no value at all ( except perhaps to eliminate some kind of 
fraud). " 

This erroneous notion of incompatibility between the use of randomization and Bayesian 
statistics in now completely outdated. One of the most prestigious textbooks in contem- 
porary Bayesian statistics, see Gelman et al. (2003, ch.7, p. 198), states: 
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"A naive student of Bayesian inference might claim that because all inference 
is conditional on the observed data, it makes no difference how those data were 
collected. This misplaced appeal to the likelihood principle would assert that 
given (1) a fixed model (including the prior distribution) for the underlying 
data and (2) fixed observed values of the data, Bayesian inference is deter- 
mined regardless of the design for the collection of the data. Under this view 
there would be no formal role for randomization in either sample surveys or 
experiments. The essential flaw in the argument is that a complete definition 
of 'the observed data' should include information on how the observed values 
arose, and in many situations such information has a direct bearing on how 
these values should be interpreted. Formally then, the data analyst needs to 
incorporate the information describing the data collection process in the prob- 
ability model used for analysis. " 

Indeed, the classical argument using the likelihood principle against randomization in 
the DSEs, assumes a fixed, given statistical model and, as concisely stated by Kempthorne 
(1977, p.l6): 

"The assertion that one does not need randomization in the conteoct of the 
assumed (linear) model (above) is an empty one because an intrinsic role of 
randomization is to 'insure ' against model inadequacies. " 

Gelman et al. (2003, ch.7, p. 223-225) proceeds offering a much deeper analysis of the 
role of randomization from a Bayesian perspective, see also Rubin (1978). The key concept 
of "ignorable design" specifies decoupling conditions between the sampling (or censoring) 
process, described by an indicator variable, /, and the distribution of the observed vari- 
ables, yobs- If the experiment has an ignorable design, we can build a statistical model that 
explicitly considers yobs alone. Finally, it is ironic that perhaps one of the best arguments 
for incorporating randomization in Bayesian experimental design is a consequence of de 
Finetti theorem for exchangeability. As mentioned in section 4, this argument also blurres 
the distinction between the concepts of randomized design and randomized sampling. We 
quote, once again, from Gelman et al. (2003, ch.7, p. 223-225): 

"How does randomization fit into this picture? First, consider the situation 
with no fully observed covariates x, in which case the 'only' way to have an 
invariant to permutation design - is to randomize. " 

"In this sense, there is a benefit to using different patterns of treatment as- 
signment for different experiments; if nothing else about the experiments is 
specified, they are exchangeable, and the global treatment assignment is neces- 
sarily randomized over the set of experiments. " 
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3.7 Randomization, Epistemic Considerations 

Several researchers currently concerned with epistemological questions in Bayesian statis- 
tics are engaged in a reductionist program dedicated to translate every statistical test or 
inference problem into a decision theoretic procedure. One of the main proponents and 
early contributors to this program, but one who also had a much broader perspective, 
clearly articulating his epistemological insights and motivations, was Bruno de Finetti. 

In statistical models our knowledge of the world is encoded in probabihty distributions. 
Hence, it is vital to clarify the epistemological or ontological status of probabihty. Let 
us examine de Finetti's position, based on his own words, beginning with Finetti (1972, 
p.l89) and Finetti (1980, p.212): 

"Any assertion concerning probabilities of events is merely the expression of 
somebody's opinion and not itself an event. There is no meaning, therefore, 
in asking whether such an assertion is true or false, or more or less probable." 

"Each individual making a 'coherent' evaluation of probability (in the sense I 
shall define later) and desiring it to be 'objectively exact', does not hurt anyone: 
everyone will agree that this is his subjective evaluation and his 'objectivist' 
statement will be a harmless boast in the eyes of the subjectivist, while it will 
be judged as true or false by the objectivist who agree with it or who, on the 
other hand, had a different one. This is a general fact, which is obvious but 
insignificant: 'Each in his own way. ' " 

Solipsism, from the Latin solus (alone) +ipse (self), can be defined as the epistemolog- 
ical thesis that the individual's subjective states of mind are the only proper or possible 
basis of knowledge. Metaphysical solipsism goes even further, stating that nothing really 
'exists' outside of one's own mind. From the two above quotations, it is clear that de 
Finetti stands, if not from a metaphysical, at least from a epistemological perspective, as 
a true solipsist. This goes farther than many theorists of the Bayesian subjectivist school 
would venture, but de Finetti charges ahead, with a program that is not only anti-realist, 
but also anti-idealist. In (1974, VI, Sec. 1.11, p. 21, 22, The Tyranny of Language), de 
Finetti launches a full-fledged attack against the vain and futile desire for any objective 
knowledge: 

"Much more serious is the reluctance to abandon the inveterate tendency of 
the savages to objectivize and mythologize everything (1); a tendency that, 
unfortunately, has been, and is, favored by many more philosophers than have 
struggled to free us from it (2). 

(1) The main responsibility for the objectivizationistic fetters inflicted on thought 
by everyday language rests with the verb 'to be' or 'to exist', and this is why we 
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drew attention to it in the exemplifying sentences. From it derives the swarm 
of pseudoproblems from Ho be or not to be', to 'cogito ergo sum', from the 
existence of 'cosmic ether' to that of 'philosophical dogmas'. 

(2) This is what distinguishes acute minds, who enlivened thought and stimu- 
lated its progress, from narrow-minded spirits who mortified and tried to mum- 
mify it ... 'great thinkers' (like Socrates and Hume) and 'school philosophers' 
(like Plato and Kant). 

De Finetti was also aware of the dangers of 'objective contamination', that is, any 
'objective' (probabihstic) statement can potentially 'infect' and spread its objectivity to 
other statements, see De Finetti (1974, V2, Sec. 7. 5. 7, p.41-42. Explanations based on 
'homogeneity'): 

"There is no way, however, in which the individual can avoid the burden of 
his own evaluations. The key can not be found that will unlock the enchanted 
garden wherein, among the fairy-rings and the shrubs of magic wands, beneath 
the trees laden with monads and noumena, blossom forth the flowers of 'Prob- 
abilitas realis '. With the fabulous blooms safely in our button-holes we would 
be spared the necessity of forming opinions, and the heavy loads we bear upon 
our necks would be rendered superfluous once and for all. " 

As we have seen in the last sections, a randomization device is built so to provide 
legitimate 'objective' probabilistic statements about some events, and randomization pro- 
cedures in DSEs are conceived exactly in order to spread this objectivity around. 

I. J. Good was an other leading figure of the early days of the Bayesian revival move- 
ment. Contrary to de Finetti, Good has always been aware of the dangers of an extreme 
subjectivist position, see for example Good (1983, Ch.8 Random Thoughts about Ran- 
domness, p. 93): 

" Some of you might have expected me, as a confirmed Bayesian, to restrict the 
meaning of the word 'probability' to subjective (personal) probability. That I 
have not done so is because I tend to believe that physical probability exists and 
is in any case a useful concept. I think physical probability can be measured only 
with the help of subjective probability, whereas de Finetti believes that it can be 
'defined' in term,s of subjective probability. De Finetti showed that if a person 
has a consistent set of subjective or logical probabilities, then he will behave 
'as if there were physical probabilities, where the physical probability has an 
initial subjective probability distribution. It seems to me that, if we are going 
to act if the physical probability exists, then we don't lose anything practical 
if we assume it really 'does ' exist. In fact I am not sure that existence means 
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more than there are no conceivable circumstances in which the assumption of 
existence would be misleading. But this is perhaps too glib a definition. The 
philosophical impact of de Finetti's theorem is that it supports the view that 
solipsism cannot be logically disproved. Perhaps it is the mathematical theorem 
with most potential philosophical impact. " 

In our terminology we would have used the expression 'objective probability' instead of 
Good's expression, 'physical probabihty'. In 1962 Good edited a collection of speculative 
essays, including some on the foundations of statistics. The following short essay by 
Christopher S.O'D. Scott offers an almost direct answer to de Finetti, see Good (1962, 
sec. 114, p.364-365): 

"Scientific Inference: You are given a large number of identical inscrutable 
boxes. You are to select one, the 'target box', by any means you wish which 
does not involve opening any boxes, and you then have to say something about 
is in it. You may do this by any means you wish which does not involve opening 
the target box. 

This apparent miracle can easily be performed. You only have to select the 
target box at random, and then open a random sample of other boxes. The 
contents of the sample boxes enable you to make an estimate of the contents 
of the target box which will be better than a chance guess. To take an extreme 
case, if none of the sample boxes contains a rabbit and your sample is large, 
you can state with considerable confidence: 'The target box does not contain a 
rabbit. ' In saying this, you make no assumption whatever about the principles 
which may have been used in filling the boxes. 

This process epitomizes scientific induction at its simplest, which is the basis 
of all scientific inference. It depends only on the existence of a method of 
randomization that is, on the assumption that events can be found which are 
unrelated (or almost) to given events. 

It is usually thought that scientific inference depends upon nature being orderly. 
The above shows that a seemingly weaker condition will suffice: Scientific in- 
ference depends upon our knowing ways in which nature is disorderly. " 

In the preceding chapters we discussed general conditions validating objective knowl- 
edge, from a constructivist epistemological perspective. In this chapter we discuss the 
use of randomization devices, that can generate observable events with distribution that 
are independent of the distribution of any event relevant to a given statistical study. For 
example, the statistical study could be concerned with the reaction of human patients 
affected by a given disease to alternative medical treatments, whereas a "good" random- 
ization device could be a generic 'coin flipping machine', like a regular dice or a mechanical 
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roulette borrowed from a casino. The randomization device could also be a sophisticated 
apparatus detecting flips (state transitions) in some quantum system, with transitions 
probabilities known with a relative precision of one over a trillion. 

So far in this chapter we have seen how well can decoupling strategies used in the 
DSEs, including randomization procedures, help us to perform robust statistical inference 
and, in doing so, escape, from a pragmatic perspective, the solipsist burdens of an extreme 
subjectivist position. The same techniques can induce no association relations, generating 
sparse or structured statistical models. No association hypotheses can then be tested, 
confirming (or not) such sparse or structured patterns in the statistical model. 

3.8 Final Remarks 

As analyzed in this chapter, the randomization method, introduced by C.S.Peirce and 
J.Jastrow (1884), is the fundamental decoupling technique used in the design of statis- 
tical experiments (DSEs). Nevertheless, only after the work of R.A.Fisher (1935), were 
randomized designs used regularly in practice. Today, randomization is one of the basic 
backbones of statistical theory and methods. Meanwhile, the pioneering work of Peirce 
had been virtually forgotten by the Statistics community, until rediscovered by the histor- 
ical research of Stigler (1978) and Hacking (1988). Nevertheless, even today, the work of 
Peirce is presented as an isolated and ad hoc contribution. As briefly indicated in section 
5, it is plausible that Peirce and Jastrow's experimental and methodological work could 
have had motivations related to more general ideas of Peircean philosophy. In particular, 
we believe that the faint effects psychophysical hypothesis can be liked to the concept of 
synechism, while the randomized design solution can be embedded in the epistemolog- 
ical framework of Pcirce's objective idealism. We believe that these topics deserve the 
attention of further research. 

In this chapter we have examined some aspects of DSEs, such as blocking, control 
and randomization, from an epistemological perspective. However, in many applications, 
most noticeably in medical studies, several other aspects have to be taken into account, 
including the well being of the patients taking part in the study. In our view, such complex 
situations require a thorough, open and honest discussion of all the moral and ethical 
aspects involved. Typically they also demand sound protocols and complex statistical 
models, suited to the fine quantitative analyses needed to balance multiple objectives 
and competing goals. For the Placebo, Nocebo, Kluge Hans, and similar effects, and 
the importance of blinding and randomization in clinical trials, see Kotz et al. (2005), 
under the entries Clinical Trials I, by N.E.Breslow, v. 2, p. 981-989, and Clinical Trials 
II, by R.Simon, v. 2, p. 989-998. For additional references on statistical randomization 
procedures, see Folks (1984), Kadane and Seidenfeld (1990), Kaptchuk and Kerr (2004), 
Karlowski et al. (1975), Kempthorne (1977, 1980), Noseworthy et al. (1994), Pfeffermann 
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Chapter 4 



Metaphor and Metaphysics: The 
Subjective Side of Science 

"Why? - That is what my name asks! 

And there He blessed him. " 
Genesis, XXXII, 30. 

"Metaphor is perhaps one of man's most fruitful potecialities. 

Its efficacy verges on magic, and it seems a tool for creation 
which God forgot inside His creatures when He made them. " 
Jose Ortega y Gasset, The Dehumanization of Art, 1925. 

"There is nothing as practical as a good theory. " 
Attrituted to Ludwig Boltzmann (1844-1906). 



4.1 Introduction 

In this chapter we proceed with the exploration of the Cognitive Constructivism epis- 
temological framework (Cog-Con), continuing the previous work developed in previous 
chapters, and briefly reviewed in section 5. In the previous chapters, we analyzed ques- 
tions concerning i/ott; objects (eigen-solutions) emerge, that is. How they (eigen-solutions) 
become known in the interaction processes of a system with its environment. These ques- 
tions had to do with laws, patterns, etc., expressed as sharp or precise hypotheses, and 
we argued that statistical hypothesis testing plays an important role in their validation. 

It is then natural to ask - Why? Why do these objects are (the way they are) and 
interact the way they do? Why-questions claim for a causal nexus in a chain of events. 
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Therefore, their answers must be theoretical constructs based on interpretations of the 
laws used to describe these events. This chapter is devoted to the investigation of these 
issues. Likewise, the interplay between the How and Why levels of inquiry which, in the 
constructivist perspective, are not neatly stratified in separate hierarchical layers, but 
interact in complex (often circular) patterns, will also be analyzed. As in the previous 
chapters, the discussion is illustrated by concrete mathematical models. In the process, 
we raise some interesting questions related to the practice of statistical modeling. 

Sections 2 examines the dictum "Statistics is Prediction" . The importance of accurate 
prediction is obvious for any statistics practitioner, but is that all there is? The investiga- 
tion on the importance of model interpretability begins in section 3, the rhetorical power 
of mathematical models, self-fulfilling prophecies and some related issues are discussed 
and a practical consulting case in Finance, concerning the detection of trading oppor- 
tunities for intraday operations in both the BOVESPA and BM&F financial markets is 
presented. In this example, the REAL classification tree algorithm, a statistical technique 
presented in Lauretto et al. (1998), is used. 

Section 4 is devoted to the issue of language dependence. Therein, the investigation 
on model interpretability continues with an analysis of the eternal counterpointing issues 
of models for prediction and models for insight. An example from Psychology, concerning 
dimensional personality models is also presented. These models are based on a dimension 
reduction technique known as Factor Analysis. 

In section 6, the necessary or "only world" vs. optimal or "best world" formulations of 
optics and mechanics are discussed. Simple examples related to the calculus of variations, 
are presented, which abridge the epistemological discussion in the following sections. Sec- 
tion 7 discusses efficient and final causal relations, teleological explanations, necessary and 
best world arguments, and the possibility or desirability of having multiple interpretations 
for the same model or multiple models for the same phenomenon. In section 8, the form 
of modern metaphysical arguments in the construction of physical theories is addressed. 

In section 9, some simple but widely applicable models based on averages computed 
over all "possible worlds", or more specifically, path integrals over all possible trajectories 
of a system, are presented. The first example in this section relates to the linear system 
Monte Carlo solution to the Dirichlet problem, a technique driven by a stochastic process 
known as Gaussian Random Walk or Brownian Motion. Section 9 also points out to a 
generalization of this process known as Fractional Brownian Motion. In sections 7 to 9 we 
also try to examine the interrelations between "only world" , "best world" and "possible 
worlds" forms of explanation, as well as their role and purpose in the light of cognitive 
constructivism, since they are at the core of modern metaphysics. 

Section 10 discusses how hypothetical models, mathematical equations, etc., relate to 
the "true nature" of "real objects" . The importance of this relationship in the history 
of science is illustrated therein with two cases: The Galileo affair, and the atomic or 
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molecular hypothesis, as presented by L.Boltzmann, A.Einstein and J.Perrin. In section 
11 our final remarks are presented. 

All discussions in the paper are motivated with illustrative examples, and these exam- 
ples follow an approximate order from soft to hard science. The example of Psychology 
presented at section 4, together with the corresponding Factor Analysis modeling tech- 
nique, is at an intermediate point of this soft-hard scale, making it a natural place for 
making a pause, taking a deep breath, and trying to get a bird's eye view of the panorama. 
Section 5 reviews some concepts of Cog-Con ontology defined in previous chapters, and 
discusses some insights on Cog-Con metaphysics. 

4.2 Statistics is Prediction. Is that all there is? 

As a first example for discussion, we present a consulting case in finance. The goal of 
this project was to implement a model for the detection of trading opportunities for 
intraday operations in both the BOVESPA and the financial markets. For details 

we refer to Lauretto et al. (1998). The first algorithms implemented were based on 
Polynomial Networks, as presented in Farlow (1984) and Madala and Ivakhnenko (1994), 
combined with standard time series pre-processing analysis techniques such as de-trending, 
de-seasonalization, differencing, stabilization and linear transformation, as exposed in 
Box and Jenkins (1976) and Brockwell and Davis (1991). A similar model is presented 
in Lauretto ct al. (2009). The predictive power of the Polynomial Network model was 
considered good enough to render a profitable return / risk performance. 

According to the decision theoretic theory, and its gambling metaphor as presented in 
section 1.5, the fundamental purpose of a statistical model is to help the user in a specific 
gambling operation, or decision problem. Hence, at least according to the orthodox 
Bayesian view, predictive power is the basic criterion to judge the quality of a statistical 
model. This conclusion is accepted with no reservations by most experts in decision 
theory, orthodox Bayesian epistemologists, and even by many general practitioners. As 
typical examples, consider the following statements: 

"We assume that the primary aim of [statistical] analysis is prediction. " 
Robert (1995, p.456). 

"Although association with theory is reassuring, it does not mean that a 
statistical fitted model is more true or more useful. All models should stand or 
fall based on their predictive power." Newman and Strojan (1998, p. 168). 

"The only useful function of a statistician is to make predictions, and thus 
to provide a basis for action. " W.E.Demming, as quoted in W.A.Wallis (1980). 

"It is my contention that the ultimate aim of any statistical analysis is 
to forecast, and that this determines which techniques apply in particular cir- 
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cumstances... The idea that statistics is all about making forecasts based on 
probabilistic models of 'reality' provides a unified approach to the subject. In 
the literary sense, it provides a consistent authorial 'voice'... the underlying 
purpose, often implicit rather than explicit, of every statistical analysis is to 
forecast future values of a variable." A. L. McLean (1998). 

Few theaters of operation so closely resemble a real casino as the stock market, hence, 
we were convinced that our model would be a success. Unfortunately, our Polynomial 
Network model was not well accepted by the client, that is, it was seldomly used for 
actual trading. The main complaint was the model's lack of interpretability. The model 
was perceived as cryptic, a "black box" capable of selecting strategic operations and com- 
puting predicted margins and success rates, but incapable of providing an explanation 
of Why the selection was recommended in the particular juncture. This state of affairs 
was quite frustrating indeed: First, the client had never explicitly required such func- 
tionality during the specification stage of the project, hence the model was not conceived 
to provide explanatory statements. Second, as a fresh Ph.D. in Operations Research, I 
was well trained in the minutiae of Measure Theory and Hilbert Spaces, but had very 
little experience on how to make a model that could be easily interpreted by somebody 
else. Nevertheless, since (good) costumcrs arc always right, a second model was specified, 
developed and implemented, as explained in the next section. 

4.3 Rhetoric and Self-Fulfilling Prophecies 

The first step to develop a new model for the problem presented in the last section, was 
to find out what the client meant by an interpretable model. After a few brainstorm 
sessions with the client, we narrowed it down to two main conditions: understandable 
1/0 and understandable rules. The first condition (understandable 1/0) called for the 
model's input and output data to be already known, familiar or directly interpretable. The 
second condition (understandable rules) called for the model's transformation functions, 
re-presentation maps or derivation rules to be also based in already known, familiar or 
directly interpretable principles. 

Technical Indicators, derived from pre-processed price and volume trading data, con- 
stituted the input to the second model. Further details on their nature will be given 
later in this section. For now, it is enough to know that they are widely used in financial 
markets, and that the client possessed ample expertise in technical analysis. The model's 
statistical data processing, on the other hand, was based on a classification tree algorithm 
specially developed for the application - the Real Attribute Learning Algorithm, or REAL, 
as presented in Lauretto et al. (1998). For general classification tree algorithms, we refer 
to Breiman (1993), Denison et al. (2002), Michie et al. (1994), Mueller and Wysotzki 
(1994), and Unger and Wysotzki (1981). 
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The REAL based model turned out to be very successful. In fact, statistically, it 
performed almost as well as the Polynomial Network model, under the performance metric 
specified in Lauretto et al. (1998). Moreover, when combined with a final interpretive 
analysis and go-ahead decision from the traders, the REAL based model performed better 
than the Polynomial Network model. The model was finally put into actual use, once it 
was perceived as interpretable and understandable. Since a large part of our consulting 
fees depended on the results in actual trading, this was an important condition for getting 
fair economical compensation for all this intellectual endeavor. 

As already mentioned, we were intrigued at the time (and still are) by many aspects 
related to model interpretation and understanding. In this section we begin to analyse 
this and other similar issues. Concerning first the very need for explanations: Humans 
seem to be always avid for explanations. They need them in order to carry out their 
deeds, and they want them to be based on already known schemata, as acknowledged in 
Damodaran (2003,ch.7,p.l7): 

"The Need for Anchors: When confronted with decisions, it is human na- 
ture to begin with the familiar and use it to make judgments. . . . 

The Power of the Story: For better or worse, human actions tend to be 
based not on quantitative factors but on story telling. People tend to look for 
simple reasons for their decisions, and will often base their decision on whether 
these reasons exist." 

The rhetorical purpose and power of statistical models have been able to conquer, 
within the statistical literature, only a small fraction of its relative importance in the con- 
sulting practice. There are, nevertheless, some remarkable exceptions, as see for example, 
in Abelson (1995, p.xin): 

"The purpose of statistics is to organize a useful argument from quanti- 
tative evidence, using a form of principled rhetoric. The word principled is 
crucial. Just because rhetoric is unavoidable, indeed acceptable, in statistical 
presentations does not mean that you should say anything you please. " 

"Beyond the rhetorical function, statistical analysis also has a narrative 
role. Meaningful research tells a story with some point to it, and statistics can 
sharpen the story. " 

Let us now turn our attention to the inputs to the REAL based model, the Technical 
Indicators, also known as Charting Patterns. For a general description, see Damodaran 
(2003, ch. 7). For some of the indicators used in the REAL project, see Colby (1988) and 
Murphy (1986). Technical indicators are primarily interpreted as behavioral patterns 
in the markets or, more appropriately, as behavioral patterns of the market players. 
Damodaran defines five groups that categorize the indicators according to the dominant 
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aspects of the behavioral pattern. A concise description of these five groups of indicators 
is given in Damodaran (2003,ch.7,p. 46-47): 

1 - External Forces / Large Scale Indicators: "If you believe that there are 
long-term cycles in stock prices, your investment strategy may be driven by 
the cycle you subscribe to and where you believe you are in the cycle. " 

2 - Lead / Follow Indicators: "If you believe that there are some traders 
who trade ahead of the market, either because they have better analysis tools 
or information, your indicators will follow these traders - specialist short sales 
and insider buying/selling, for instance - with the objective of piggy-backing 
on their trades. " 

3 - Persistence / Momentum Indicators: "With momentum indicators, such 
as relative strength and trend lines, you are assuming that markets often learn 
slowly and that it takes time for prices to adjust to true values. " 

4 - Contrarian / Over Reaction Indicators: "Contrarian indicators such 
as mutual fund holdings or odd lot ratios, where you track what investors are 
buying and selling with the intention of doing the opposite, are grounded in 
the belief that markets over react. " 

5- Change of Mind / Price- Value Volatility Indicators: "A number of tech- 
nical indicators are built on the presumption that investors often change their 
views collectively, causing shifts in demand and prices, and that patterns in 
charts - support and resistance lines, price relative to a moving average- can 
predict these changes. " 

At this point, it is important to emphasize the dual nature of technical indicators: 
They disclose some things that may be happening with the trading market and also some 
things that may be happening with the traders themselves. In other words, they portray 
dynamical patterns of the market that reflect behavioral patterns of the traders. 

Two characteristics of the REAL based model, of vital importance to the success in 
the consulting case presented, relate to rhetorical and psychological aspects that have 
been commented so far: 

- Its good predictive and rhetorical power, which motivated the client to trade on the 
basis of the analyses provide by the model; 

- The possibility of combining and integrating the analyses provided by the model 
with expert opinion. 

Technical indicators often carry the blame of being based in self-fulfilling prophecies, 
over-simplified formulas, superficial and naive behavioral patterns, unsound economic 
grounds, etc. From a pragmatic perspective, market analysts do not usually care about 
technical analysis compatibility with sound economic theories, mathematical sophistica- 
tion, etc. Its ability to detect trading opportunities is what counts. From a conceptual 
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perspective, each of these analyses docs tell a story about a cyclic reinforcement or correc- 
tion (positive or negative feed-back) mechanism in the financial system. What is peculiar 
about self-fulfilling prophecies is that the collective story telling activity is a vital link in 
the feed-back mechanism. It is not surprising then that the market players' perception 
of how good the story itself looks should play an important role in fortelling whether 
the prophecy will come true. Prom this perspective one can understand the statement in 
Murphy (1986, p.l9): 

"The self-fulfilling prophecy (argument) is generally listed as a criticism of 
charting. It might be more appropriate to label it as a compliment. " 

The importance of the psychological aspects of the models studied in this section 
motivate us to take a look, in the sequel, at some psychological models of personality. 

4.4 Language, Metaphor and Insight 

In chapter 1, the dual role played by Statistics in scientific research, namely, predicting 
experimental events and testing hypotheses, was pointed out. It was also emphasized that, 
under a constructivist perspective, these hypotheses are often expressed as equations of 
a mathematical model. In the last section we began to investigate the importance of the 
interpretability of these models. The main goal of this section is to further investigate 
subjective aspects of a statistical or mathematical model, specifically, the understanding 
or insight it provides. 

We start with three diffent versions of the well-known motto of Richard Hamming: 

- "The purpose of models is insight, not numbers. " 

- "The purpose of computing is insight, not numbers. " 

- "The purpose of numbers is insight, not numbers. " 
Dictionary definitions of Insight include: 

- A penetrating, deep or clear perception of a complex situation; 

- Grasping the inner or hidden nature of things; 

- An intuitive or sudden understanding. 

The illustrative case presented in this section is based on psychological models of 
personality. Many of these models rely on symmetric configurations known as "mandala" 
schemata, see for example Jung (1968), and a good example is provided by the five 
elements model of traditional Chinese alchemy and their associated personality traits: 

1- Fire: Extroverted, emotional, emphatic, self-aware, sociable, eloquent. 
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2- Earth: Caring, supporting, stable, protective, worried, attached. 

3- Metal: Analytical, controlling, logical, meticulous, precise, zealous. 

4- Water: Anxious, deep, insecure, introspective, honest, nervous. 

5- Wood: Angry, assertive, creative, decisive, frustrated, leading. 

Interactions between elements are conceived as a double feed-back cycle, represented 
by a pentagram inscribed in a pentagon. The pentagon or external cycle represent the 
creation, stimulus or positive feed-back in the system, while the pentagram or internal cy- 
cle represent the destruction, control or negative feed-back in the system. The traditional 
representation of these systemic generative mechanisms or causal relations are: 

Pentagon: fire [calcinates to ) earth [harbors) metal [condenses) water [nourishes) 
wood [fuels) fire. 

Pentagram: fire [melts) metal [cuts) wood [incorporates) earth [absorbs) water 
[extinguishes) fire. 

This double feed-back structure allows the representation of system with complex in- 
terconnections and nontrivial dynamical properties. In fact, the systemic interconnections 
are considered the key for understanding a general five-element model, rather than any 
superficial analogy with the five elements' traditional labels. 

It is an entertaining exercise to compare and relate the five alchemical elements listed 
above with the five groups of technical indicators presented in the last section, or with the 
big-five personality factors presented next, even if some of these models are considered 
pre-scientific. Why, for example, do these models employ exactly five factors? That 
is, why is it that "four are few and six are many"? Is there an implicit mechanism in 
the model, see Hargittai (1992), Hotchkiss (1998) or Philips (1995, ch.2), or is this an 
empirical statement supported by research data? 

Scientific psychometric models must be based on solid statistical analysis of testable 
hypotheses. Factor Analysis has been one of the preferred techniques used in the con- 
struction of modern psychometric models and it is the one used in the examples we discuss 
next. In section C.5, the basic structure of factor analysis statistical models is reviewed. 

In Allport and Odbert (1936) the authors presented their Lexical Hypothesis. Ac- 
cording to them, important aspects of human life correspond to words in the spoken 
language. Also the number of corresponding terms in the lexicon is supposed to reflect 
the importance of each aspect: 

"Those individual differences that are most salient and socially relevant 
in peoples lives will eventually become encoded into their language; the more 
important such a difference, the more likely is it to become expressed as a 
single word." 
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One of the most widely used factor model takes into account five factors or personality 
traits. These are the five dimensions of the "OCEAN" personality model or "big-five". 
Further details on the meaning of these factors can be found in Shelder and Westen 
(2004), from the list of the most relevant factor loadings. The "OCEAN" labels, ordered 
according to their statistical relevance, are: 

1- Extraversion, Energy, Enthusiasm; 

2- Agreeableness, Altruism, Affection; 

3- Conscientiousness, Control, Constraint; 

4- Neuroticism, Negative Affectivity, Nervousness; 

5- Openness, Originality, Open-mindedness. 

Subsequent studies pointed to the "existence" of more factors, for a review of several 
of such models, see Widiger and Simonsen (2005). Herein, we focus our attention in 
the 12-factor model of Shelder and Westen. We remark, however, that the publication 
of the 12-factor model, fired an inflamed hterary debate concerning the necessity (or 
not) of more than 5 factors. In the quotation below, Shelder and Westen (2004, p. 1752- 
1753) pinpoint the issue of language dependence in the description of reality, an issue of 
paramount importance in cognitive constructivism and one of the main topics analyzed 
in this section. 

"Applying the Lexical Hypothesis to Personality Disorders: 
Ultimately, the five-factor model is a model of personality derived from the 
constructs and observations of lay-people, and it provides an excellent map 
of the domains of personality to which the average layperson attends. How- 
ever, the present findings suggest that the five-factor model is not sufficiently 
comprehensive for describing personality disorders or sophisticated enough for 
clinical purposes. 

In contrast to laypeople, practicing clinicians devote their professional lives 
to understanding the intricacies of personality. They develop intimate knowl- 
edge of others lives and inner experience in ways that may not be possible in 
everyday social interaction. Moreover, they treat patients with variants of per- 
sonality pathology that laypeople encounter only infrequently ( and are likely to 
avoid when they do encounter it). One would therefore expect expert clinicians 
to develop constructs more differentiated than those of lay observers. 

Indeed, if this were not true, it would violate the lexical hypothesis on which 
the five-factor model rests: that language evolves over time to reflect what is 
important. To the extent that mental health professionals observe personality 
with particular goals and expertise, and observe the more pathological end of 
the personality spectrum, the constructs they consider important should differ 
from those of the average layperson. " 
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The issue of language dependence is very important in cognitive constructivism. For 
further discussion, see Maturana (1988, 1991). Thus far we have stressed the lexical 
aspect of language, that is, the importance of the available vocabulary in our description 
of reality. In the remaining part of this section we shall focus on the symbolic or figurative 
use of the language constructs in these descriptions. We proceed by examining in more 
detail the factor analysis model. 

Factor analysis is a dimension reduction technique. Its application renders a 'simple' 
object, the factor model, capable of efficiently "coding" , into a space of reduced dimension, 
a complex 'real' object from a full or high dimensional space. In other words, a dimension 
reduction technique presumes some form of valid knowledge transference, back and forth 
the complex (high dimensional) object and its simple (low dimensional) model. Hence, the 
process of using and interpreting factor analysis models can be conceived as metaphorical. 
Recall that the Greek word metaphor stands for transport or transfer, so that a linguistic 
metaphor transfers some of the characteristics of one object, called the source or vehicle, 
into a second distinct object, called the target, tenant or topic; for a comprehensive 
reference see Lakoff and Johnson (2003). 

For reasons which are similar to those studied in the last section, most users of a 
personality model require it to be statistically sound. Many of them further demand it 
to be interpretable, in order to provide good insights to their patient's personality and 
problems. A good model should not only be useful in predicting recovery rates or drug 
effectiveness, but also help in supplying good counseling or therapeutics. 

Paraphrasing Vega- Rodriguez (1998): 

The metaphorical mechanism should provide an articulation point between the empir- 
ical and the hypothetical, the rational and the intuitive, between calculation and insight. 

The main reason for choosing factor analysis to illustrate this section is its capabil- 
ity of efficiently and transparently building sound statistical models that, at the same 
time, provide intuitive interpretations. While soundness is the result of "estimation and 
identification tools", such as ML (maximum likelihood) or MAP (maximum a posteri- 
ori) optimization, hypothesis testing and model selection, interpretableness results from 
"representation tools" , such as orthogonal and oblique factor rotation techniques. 

Factor rotation tools are meant to reconfigure the structure of a given factor anal- 
ysis model, so as to maintain its probabilistic explanatory power while maximizing its 
heuristic explanatory power. Factor rotations are performed to implement an objective 
optimization criteria, such as sparsity or entropy maximization. The optimal solution (for 
each criterion) is unique and hoped to enhance model interpretability a great deal. 
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4.5 Constructive Ontology and Metaphysics 

How important heuristic arguments are in other areas of science? Should statistical or 
mathematical models play a similar rhetorical role in other fields of application? We will 
try to answer these questions by discussing the role played by similar heuristic arguments 
in physics. In sections 2 and 3 we dealt with application areas in which text(ure) manu- 
facture comprised, to a great extent, the very spinning of the threads. Nevertheless, one 
can have the false impression that the constructivist approach suits better high level, soft 
science areas, rather than low level, rock bottom Physics. This widely spread miscon- 
ception is certainly not the case. In sections 7 through 10 we analyze the role played in 
science by metaphysics, a very special form of heuristic argumentation. 

The example presented in section 4, together with the corresponding Factor Analysis 
modeling technique, is at an intermediate point of the soft-hard science scale used herein to 
(approximately) order the examples. Therefore, as previously stated in the introduction, 
we shall use section the current section to make a pause in the exposition, take a deep 
breath, and try to get a bird's eye view of the scenario. This section also reviews some 
concepts of Cog- Con ontology defined in previous chapters and discusses some insights 
on Cog-Con metaphysics. 

The Cog-Con framework rests upon two basic metaphors: the Heinz von Forster's 
metaphor of Object as token for an eigensolution, which is the key to Cog-Con ontology, 
and the Humberto Maturana and Francisco Varela's metaphor of Autopoiesis and cogni- 
tion, the key to Cog-Con metaphysics. Below we review these two metaphors, as they 
where used in chapter 1. 

Autopoiesis and Cognition 

Autopietic systems are non-equilibrium (dissipative) dynamical systems exhibiting (meta) 
stable structures, whose organization remains invariant over (long periods of) time, despite 
the frequent substitution of their components. Moreover, these components are produced 
by the same structures they regenerate. As an example, take the macromolecular pop- 
ulation of a single cell, which can be renewed thousands of times during its lifetime, see 
Bertalanffy (1969). However, in spite of the fact that autopoiesis was a metaphor devel- 
oped to suit the essential characteristics of organic life, the concept of autopoietic system 
has been applied in the analysis of many other concrete or abstract autonomous systems 
such as social systems and corporate organizations, see for example Luhmann (1989) and 
Zelleny (1980). 

The regeneration processes in the autopoietic system production network require the 
acquisition of resources such as new materials, energy and neg-entopy (order), from the 
system's environment. Efficient acquisition of the needed resources demands selective 
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(inter) actions which, in turn, must be based on suitable inferential processes (predictions). 
Moreover, these inferential processes characterize the agent's domain of interaction as a 
cognitive domain. For more details sec the comments in chapter 1 and, more importantly, 
the original statements in Maturana and Varela (1980, p. 10): 

"The circularity of their organization continuously brings them back to the 
same internal state (same with respect to the cyclic process). ... Thus the 
circular organization implies the prediction that an interaction that took place 
once will take place again. ... Accordingly, the predictions implied in the 
organization of the living system are not predictions of particular events, but 
of classes of inter- actions. ... This makes living systems, inferential systems, 
and their domain of interactions a cognitive domain." 

Object as Tokens for Eigen-Solutions 

The circular (cyclic or recursive) characteristic of autopoietic regenerative processes and 
their eigen (auto, equilibrium, fixed, homeostatic, invariant, recurrent, recursive) -states, 
both in concrete and abstract autopoietic systems, are investigated in Foerster (2003) and 
Segal (2001). 

"The meaning of recursion is to run through one's own path again. One of 
its results is that under certain conditions there exist indeed solutions which, 
when reentered into the formalism, produce again the same solution. These 
are called "eigen- values" , "eigen-functions" , "eigen-behaviors" , etc., depend- 
ing on which domain this formation is applied - in the domain of numbers, in 
functions, in behaviors, etc." Segal (2001, p. 145). 

The concept of eigen-solution for an autopoietic system is the key to distinguish specific 
objects in a cognitive domain. 

"Objects are tokens for eigen-behaviors. Tokens stand for something else. In 
exchange for money (a token itself for gold held by one's government, but 
unfortunately no longer redeemable), tokens are used to gain admittance to 
the subway or to play pinball machines. In the cognitive realm, objects are the 
token names we give to our eigen-behavior. ... When you speak about a ball, 
you are talking about the experience arising from your recursive sensorimotor 
behavior when interacting with that something you call a ball. The "ball" 
as object becomes a token in our experience and language for that behavior 
which you know how to do when you handle a ball. This is the constructivist's 
insight into what takes place when we talk about our experience with objects." 
Segal (2001, p. 127). 
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Furthermore, von Foerster establishes four essential attributes of eigen-solutions: 

"Eigenvalues have been found ontologically to be discrete, stable, separable 
and composable, while ontogenetically to arise as equilibria that determine 
themselves through circular processes. Ontologically, Eigenvalues and objects, 
and likewise, ontogenetically, stable behavior and the manifestation of a sub- 
ject's "grasp" of an object cannot be distinguished." Foerster (2003, p. 266). 

Constructive Ontology 

The Cog-Con framework also includes the following conception of reality and some related 
terms, as defined in chapter 2: 

1. Known (knowable) Object: An actual (potential) eigen-solution of a given 
system's interaction with its environment. In the sequel, we may use a some- 
what more friendly terminology by simply using the term Object. 

2. Objective (how, less, more): Degree of conformance of an object to the 
essential attributes of an eigen-solution (to be precise, stable, separable and 
composable) . 

3. Reality: A (maximal) set of objects, as recognized by a given system, when 
interacting with single objects or with compositions of objects in that set. 

The Cog-Con framework assumes that an object is always observed by an observer, just 
like a living organism or a more abstract system, interacting with its environment. There- 
fore, this framework asserts that the manifestation of the corresponding eigen-solution and 
the properties of the object are respectively driven and specified by both the system and 
its environment. More concisely, Cog-Con sustains: 

4- Idealism: The belief that a system's knowledge of an object is always 
dependent on the systems' autopoietic relations. 

5. Realism: The behef that a system's knowledge of an object is always 
dependent on the environment's constraints. 

Consequently, the Cog-Con perspective requires a fine equilibrium, called Realistic or 
Objective Idealism. Solipsism or Skepticism are symptoms of an epistemological analyses 
that loose the proper balance by putting too much weight on the idealistic side. Con- 
versely, Dogmatic Realism is a symptom of an epistemological analyses that loose the 
proper balance by putting too much weight on the realistic side. Dogmatic realism has 
been, from the Cog-Con perspective, a very common (but mistaken) position in modern 
epistemology. Therefore, it is useful to have a specific expression, namely, something in 
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Figure 1: Scientific production diagram. 



itself to be used as a marker or label for such ill posed dogmatic statements. The method 
used to access something in itself is often described as: - Something that an observer 
would observe if the (same) observer did not exist, or - Something that an observer could 
observe if he made no observations, or - Something that an observer should observe in the 
environment without interacting with it (or disturbing it in any way), and many other 
equally senseless variations. 

Although the application of the Cog-Con framework is as general as that of autopoiesis, 
this paper is focused on scientific activities. The interpretation of scientific knowledge as 
an eigensolution of a research process is part of a Cog- Con approach to epistemology. Fig- 
ure 1 presents an idealized structure and dynamics of knowledge production, sec Krohn 
and Kiippers (1990) and chapters 1 and 6. The diagram represents, on the Experiment 
side (left column) the laboratory or field operations of an empirical science, where ex- 
periments are designed and built, observable effects are generated and measured, and 
an experimental data bank is assembled. On the Theory side (right column), the dia- 
gram represents the theoretical work of statistical analysis, interpretation and (hopefully) 
understanding according to accepted patterns. If necessary, new hypotheses (including 
whole new theories) are formulated, motivating the design of new experiments. Theory 
and experimentation constitute a double feed-back cycle making it clear that the design 
of experiments is guided by the existing theory and its interpretation, which, in turn, 
must be constantly checked, adapted or modified in order to cope with the observed 
experiments. The whole system constituting an autopoietic unit. 
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Fact or Fiction? 

At this point it is useful to (re)turn our attention to a specific model, namely, factor anal- 
ysis, as discussed in section 4, and consider the following questions raised by Brian Everitt 
(1984, p. 92, emphases are ours) concerning the appropriate interpretation of factors: 

" Latent variables - fact or fiction? One of the major criticisms of factor 
analysis has been the tendency for investigators to give names to factors, and 
subsequently, to imply that these factors have a reality of their own over 
and above the manifest variables. This tendency continues with the use 
of the term latent variables since it suggests that they are existing variables 
and that there is simply a problem of how they should be measured. In 
truth, of course, latent variables will never be anything more than is contained 
in the observed variables and will never be anything beyond what has been 
specified in the model. For example, in the statement that verbal ability is 
whatever certain test have in common, the empirical meaning is nothing more 
than a shorthand for the observations of the correlations. It does not mean 
that verbal ability is a variable that is measurable in any manifest sense. 
However, the concept of latent variable may still be extremely helpful. A 
scientist may have a number of hypothetical constructs in terms of which some 
theory is formulated, and he is willing to assume that the latent variables used 
in specifying the structural models of interest arc the operational equivalents 
to theoretical constructs. As long as it is remembered that in most cases there 
is no empirical way to prove this correspondence, then such an approach can 
lead to interesting and informative theoretical insights." 

Ontology is a term used in philosophy in reference to a systematic account of existence 
or reality. We have already established the Cog-Con approach to objects as tokens for 
eigen-solutions, and explained their four essential attributes, namely, discreteness (pre- 
ciseness, sharpness or exactness), stablity, separability and composability. Therefore, in 
the Cog-Con framefwork, accessing the ontological status of an object, or to say how 
objective it is, is to ascertain how well it manifests the four essential attributes of an 
eigen-solution. 

The Full Bayesian Significance Test, or FBST, is a possibilistic belief calculus, based 
on (posterior) probabilistic measures, that was conceived as a statistical significance test 
to access the objectivity of an eigen-solution, that is, to measure how well a given object 
manifests or conforms to von Foerster's four essential attributes. The FBST belief or 
credal value, ev{H \ X), the e-value of hypothesis H given the observed data X, is inter- 
preted as the epistemic value of hypothesis H (given X), or the evidence value of data X 
(supporting H). The formal definition of the FBST and several of its implementation in 
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specific problems can be found in the author's previous pubhcations, and are reviwed in 
appendix A. 

Greek or Latin? Latent or Manifest? 

We have already discussed the ontological status of an object. This discussion assumes 
testing hypotheses in a statistical model which, in order to built, one must know how 
to distinguish concrete measurable entities from abstract concepts, observed values from 
model parameters, latent from manifest variables, etc. When designing and conducting an 
experiment, a scientist must have a well defined a statistical model, and keep these distinc- 
tions crisp and clear. This is so important in the experimental sciences that statisticians 
have the habit of using Latin letters for obscrvables, and Greek letters for parameters. 
When a statistician questions whether a letter is Latin or Greek, he or she is not asking 
for help with foreign alphabets, but rather seeking information about the aforementioned 
distinctions. 

According to the positivist philosophical school, measurable entities, observed values, 
manifest variables, etc. are the true, first class entities of a hard science, while abstract 
concepts, model parameters, latent variables, etc. should be considered second class 
entities. One reason for downgrading the later class is that the positivist school assumes 
a nominalist perspective. Nominalism (at least in its strictest form) considers abstract 
concepts as mere names (nomina), that may stand as proxy for a "really existing item", 
denoting "that singular thing" (supponere pro ilia re singulari). The Cog-Con perspective 
plays no role in the positivist dream. This issue will be further investigated in the next 
sub-section, as well as in sections 8 and 11. For now we offer the following argument: 

Although for a given model, the aforementioned distinctions between what to write 
using Latin or Greek letters should be always crisp and clear, we may have to simul- 
taneously work with several models. For example, we may need to use several models 
hierarchically organized to cope with phenomena at different scales or levels of granu- 
larity, like models in physics, chemistry, biology, and psychology, see chapters 5 and 6. 
We may also need different models for competing theories trying to explain a given phe- 
nomenon. Finally, we may need different models providing equivalent or compatible laws 
to given phenomena that, nevertheless, use distinct theoretical approaches, see section 8, 
9 and 10. The positivist dream quickly turns into a nightmare when one realizes that an 
entity corresponding to a Greek letter variable in one model corresponds to a Latin letter 
variable in another, and vice-versa. 

It is also important to realize that in the Cog-Con approach the ontological status of 
an object is a reference to the properties of the corresponding eigen-solution emerging in 
a cyclic process. This leads to an intrinsically dynamic approach to ontology, in sharp 
contrast with other analyses based on static categories. A consequence of this dynamical 
setting is that in the Cog-Con approach a statement about the ontological status of 



4.5. CONSTRUCTIVE ONTOLOGY AND METAPHYSICS 



101 



a single element or isolated component in a process is an indirect reference to its role 
in the emergence of the corresponding eigen-solution. Equivalent or similar elements 
may play very different roles in distinct processes. Such distinct or multiple roles will 
not pose conceptual difficulties to the Cog-Con framework as long as the corresponding 
(statistical) models are clearly stated and well defined. For interesting examples of this 
situation, typical of modular and hierarchical architectures, hypercyclical organization, 
and emergent properties, see chapters 5 and 6. 



Constructive Metaphysics 

Metaphysics, in its gnosiological sense, is a philosophical term we use to refer to a sys- 
tematic account of possible forms of understanding, valid forms of explanation or rational 
principles of intelligibility. In science, such explanations are often well represented in a 
schematic diagram describing the organization of a conceptual network. A link in such a 
diagram expresses a theoretical relation like, for example, a causal nexus, that is, a cause 
and effect relation. In modern science, such explanations must also include the symbolic 
derivation of scientific hypotheses from general scientific laws, the formulation of new laws 
in an existing theory, and even the conception of new theories, as well as their general 
understanding based on general metaphysical principles. 

In this context, it is natural to ask questions like: What do we mean by the intuitive 
quality or theoretical importance of a concept or, more generally, of a sub-network? How 
interesting are the insights we gain from it? How can we access its explanatory power or 
heuristic value? We will try to answer these questions in the following sections, most spe- 
cially in section 8, on modern metaphysics. In this section we provide only a preliminary 
discussion of the importance of metaphysical entities in the constructivist perspective. 

We now return to Humberto Maturana and Francisco Varela's metaphor of autopoiesis 
and cognition. As stated at the beginning of this section this metaphor is the key for Cog- 
Con metaphysics. From details of this metaphor we conclude that the autopoietic relations 
of a system not only define who or what it "is" , but also limit the class of interactions in 
which it can possibly engage or the class of events it can possibly perceive. An adaptive 
system can learn, that is, it can reconfigure its internal organization, reshape its architec- 
ture, in order to enlarge its scope of inference or make better predictions. Nevertheless, 
learning is an evolutive process, and any evolutionary path to the future has to progress 
from the system's present (or initial) configuration. From the above considerations it is 
clear that, from a constructivist perspective, the specification of autopoietic relations are 
of vital importance since they literally define the scope and possibilities of the system's 
hfe. 
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Theoretical Insights 

Cog-Con approaches science as an autopoietic system whose organization is coded by 
symbohc laws, causal relations, and metaphysical principles. Consequently, we must 
give them the greatest importance. Nevertheless, such metaphysical entities are even 
more abstract than the latent variables discussed in the last subsection. In contrast with 
the constructivist approach, the positivist school is thus quite hostile to metaphysical 
concepts. 

In the Cog-Con perspective, metaphysics provides meaning to objects in a give real- 
ity, explaining why the corresponding eigen-solutions manifest themselves the way they 
do. Accordingly, theoretical concepts become building blocks in the coding of systemic 
knowledge and reference marks in the mapping of the systems environment. Conceptual 
relations are translated into inference tools, thus becoming, by definition, the basis of 
autopoietic cognition. In the Cog-Con perspective, better understanding will strengthen 
a given theoretical architecture or entail its evolution. In so doing, the importance of the 
pertinent concepts is enhanced, their scope is enlarged and their utility increased. The 
whole process enables richer and wider connections in the web of knowledge, embedding 
theory even deeper in the system's life, revealing more links in the great chain of being! 

4.6 Necessary and Best Worlds 

In sections 7 through 10 we analyze the role played in modern science by metaphysics, 
a very special form of heuristic argumentation. Such arguments often explain why a 
system follows a given trajectory or evolves along a given path. These arguments may 
explain why a system must follow a necessary path or is effectively forced along a single 
trajectory; these are "only world" explanations. Teleological arguments explain why a 
system chooses the best trajectory according to some optimality criterion; these are "best 
world" explanations. Stochastic or integral arguments explain why the system evolution 
takes into account, including, averaging, summing or integrating over, all possible or 
admissible trajectories; these are "possible worlds" explanations. 

In sections 7 to 9 we also try to examine the interrelations between "only world" , "best 
world" and "possible worlds" forms of explanation, as well as their role and purpose in 
the light of cognitive constructivism, since they are at the core of modern metaphysics. 
We begin this journey by studying in this section a simple and seemingly innocent mathe- 
matical puzzle. The puzzle, which will be solved directly by elementary calculus, is in fact 
used by Richard Feynman as an allegory to present an important variational problem. 

Consider a beach with shore line represented hj x = a, in the standard Cartesian 
plane. A lifeguard, at position {x, y) = (0, 0), spots a person drowning at position {x, y) = 
(a + b,d). While on the athletic track the lifeguard car can run at top speed c, on the 
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sand it can run at speed c/ui. Once in the water, the hfeguard can only swim at speed 
c/z/2, 1 < 1^1 < V2- Letting ix^y) = {a,y{a)) be the point where he enters the water, what 
is the optimal value y{a) = z if he wants to reach position (a + b, d) as fast as possible? 

Since the shortest path in an homogenous medium is a straight line, the optimal 
trajectory is a broken line, from (0,0) to {a,z), and then from {a,z) to (a + b,d). The 
total travel time is J{z)/c, where 

Since we want J{z) at a minimum, we set 

dJ -2z -2(d-z) 

-r- = J^i — . + 1^2 , = = , 

dz 2Va2 + z^ 2^W^^[d^zf 

so that, we should have 

v\ sin(^i) = V2 sin(^2) ■ 

Professional lifeguards claim that this simple model can be improved by dividing the 
sand in a dry band, V^, and a wet band, V2, and the water in a shallow band, V3, and a 
deep band, V4, with respective different media 'resistance' indices, i^i, v^i J^s, ^4, satisfying 
1/4 > Us > i/i > 1/2 > 1- Although the solution for the improved model can be similarly 
obtained, a general formalism to solve 'variational' problems of this kind exists which is 
known as the Euler-Lagrange equation. For an instructive introduction see Krasnov et al. 
(1973), Leech (1963) and Marion (1970). 

The trigonometric relation, z/(a;) sin(^) = K, obtained in the last equation, is known in 
optics as Snell-Descartes' law. It explains the refraction (bending) of a light ray incident to 
a surface separating two distinct optic media. In this relation, u is the medium refraction 
index. The variational problem solved above was proposed by Pierre de Fermat in 1662 to 
'explain' Snell-Descartes' law. Fermat's principle of least time states that a ray of light, 
going from one point to another, follows the path which is traversed in the smallest time. 

Notice that Fermat enounced this principle before any measurement of the speed of 
light. The first quantitative estimate of the speed of light, in sidereal space, was obtained 
by O. Roemer in 1676. He measured the Doppler effect on the period of lo, a sateUite 
of Jupiter discovered by Gahleo in 1610. More precisely, he measured the violet and 
red shifts, i.e., the variation for shorter and longer in the observed periods of lo, as 
the Earth traveled in its orbit towards and away from Jupiter. Roemer's final estimate 
was c = lau/11', that is, one astronomial unit (the length of the semi-major axis of 
the earth's elliptical orbit around the sun, approximately 150 million kilometres) per 
11 minutes. Today's value is around la-u/8'20". The first direct measurements of the 
comparative speed of light in distinct material media (air and water) were obtained by 
Leon J.B.Foucault, almost two centuries latter, in 1850, using a rotating mirror device. 
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For details, see Tobin (1993) and Jaffe (1960). For a historical perspective of several 
competing theories of light we refer to Ronchi (1970) and Sabra (1981). 

Snell-Dcscartes' "law" is an example of mathematical model that dictates a "necessary 
world", stating, plain and simple, how things "have to be". In contrast, Fermat's "princi- 
ple" is a theoretical construct that elects a "best world" according to some criterion used 
to compare "possible worlds" . 

Fermat's principle is formulated minimizing the integral of ds = 1/dt. In a similar 
way, Leibniz, Euler, Mauperius, Lagrange, Jacobi, Hamilton, and many others were able 
to reformulate Newtonian mechanics, minimizing the integral of a quantity called action, 
ds — L dt, where the Lagrangian, L, is the difference between the kinetic energy (Leibniz' 
vis viva), (l/2)mv^, and the potential energy of the system (Leibniz' vis morta). Hence, 
these formulations are called in physics principles of minimum action or principles of least 
action. 

4.7 Efficient and Final Causes 

At the XVII century, several models of light and its propagation were developed to explain 
Snell-Descartes' law, see Sabra (1981). The discussion of these models, and the necessary 
versus best world formulations of optics and mechanics discussed in the last section are 
historically connected to the discussion of the metaphysical concepts of efficient and final 
causes. 

This terminology dates back to Aristotle, who distinguishes, in Metaphysics, four 
forms of causation, that is, four types of answers that can be given to a Why-question. 
Namely: 

- Material cause: Because it is made of, or its constituent parts are ... 

- Formal cause: Because it has the form of, or is shaped hke ... 

- Efficient cause: Because it is produced, or accomplished by ... 

- Final cause: Because it is intended to, or has the purpose of ... 

Efficient and final causes are the subject of this section. For a general overview of the 
theme in the history of 17th and 18th century Physics, see Brunet (1938), Dugas (1988), 
Pulte (1989), Goldstine (1980), Wiegel (1986) and Yourgrau and Mandelstam (1979). 

Newtonian mechanics is formulated only in terms of efficient causes - an existing 
force acts on a particle (or body) producing a movement described by the Newtonian 
differential equations. Least action principles, on the other hand, are formulated through 
the use of a final cause: the trajectory followed by the particle (or light ray) is that which 
optimizes a certain characteristic, given its original and final positions. This is why these 
formulations are also called teleological, from the Greek reXoq, aim, goal or purpose. A 
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general discussion of tclcological principles in this context was presented by Leibniz in his 
Specimen Dynamicum of 1695, a translation of which appears in Loemker (1969, p. 442). 

"In fact, as I have shown by the remarkable example of the principles of 
optics, ....(that) final causes may be introduced with great fruitfulness even 
into the special problems of physics, not merely to increase our admiration 
for the most beautiful works of the supreme Author, but also to help us make 
predictions by means of them which could not be as apparent, except perhaps 
hypothetically , through the use of efficient cause... It must be maintained in 
general that all existent facts can be explained in two ways - through a kingdom 
of power or efficient causes and through a kingdom of wisdom or final causes... 
Thus these two kingdoms everywhere permeate each other, yet their laws are 
never confused and never disturbed, so the maximum in the kingdom of power, 
and the best in the kingdom of wisdom, take place together. " 

Euler and Maupertuis generalized the arguments of Fermat and Leibniz, deriving 

Newtonian mechanics from the least action principle. The Principle of Least Action, was 
stated in Maupertuis (1756, IV, p. 36), as his Lois du Mouvement, Principe General, 

"Laws of Movement, General Principle: 

When a change occurs in Nature, the quantity of action necessary for that 
change is as small as possible. 

The quantity of action is the product of the mass of the bodies times their 
speed and the distance they travel. When a body is transported from one place 
to another, the action is proportional to the mass of the body, to its speed and 
to the distance over which it is transported. " 

Maupertuis also used the same theological arguments of Leibniz regarding the harmony 
between efficient and final causes. In Maupertuis (1756, IV, p. 20-23 of Accord de Differents 
Lois de la Nature, qui avoient jusqu'ici paru incompatibles) , for example, we find: 

"Accord Between Different Laws of Nature, that seemed incompatible. . . . 

I know the distaste that many mathematicians have for final causes applied 
to physics, a distaste that I share up to some point. I admit, it is risky to 
introduce such elements; their use is dangerous, as shown by the errors made 

by Fermat (and Leibniz(?)) in following them. Nevertheless, it is perhaps not 
the principle that is dangerous, but rather the hastiness in taking as a basic 
principle that which is merely a consequence of a basic principle. 

One cannot doubt that everything is governed by a supreme Being who 
has imposed forces on material objects, forces that show his power, just as he 
has fated those objects to execute actions that demonstrate his wisdom. The 
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harmony between these two attributes is so perfect, that undoubtedly all the 
effects of Nature could be derived from each one taken separately. A blind and 
deterministic mechanics follows the plans of a perfectly clear and free Intellect. 
If our spirits were sufficiently vast, we would also see the causes of all physical 
effects, either by studying the properties of material bodies or by studying what 
would most suitable for them to do. 

The first type of studies is more within our power, but does not take us far. 
The second type may lead us stray, since we do not know enough of the goals 
of Nature and we can be mistaken about the quantity that is truly the expense 
of Nature in producing its effects. 

To unify the certainty of our research with its breadth, it is necessary to 
use both types of study. Let us calculate the motion of bodies, but also consult 
the plans of the Intelligence that makes them move. 

It seems that the ancient philosophers made the first attempts at this sort 
of science, in looking for metaphysical relationships between numbers and ma- 
terial bodies. When they said that God occupies himself with geometry, they 
surely meant that He unites in that science the works of His power with the 
perspectives of His wisdom. " 

Some of the metaphysical explanation given by Leibniz and Maupertuis are based on 
theological arguments which can be regarded as late inheritances of medieval philosophy. 
This form of metaphysical argument, however, faded away from the mainstream of science 
after the 18th century. Nevertheless, in the following century, the (many variations of 
the) least action principle disclosed more powerful formalisms and found several new 
applications in physics. For details, see Goldstine (1980) and Wiegel (1986). As stated 
in Yourgrau and Mandelstam (1979, ch.l4 of The Significance of Variational Principles 
in Natural Philosophy), 

"Towards the end of the (XIX) century, Helmholtz invoked, on purely sci- 
entific grounds, the principle of least action as a unifying scientific natural 
law, a 'leit-motif dominating the whole of physics, Helmholtz (1887). 

'From these facts we may even now draw the conclusion that the domain 
of validity of the principle of least action has reached far beyond the bound- 
aries of the mechanics of ponderable bodies. Maupertuis ' high hopes for the 
absolute general validity of his principle appear to be approaching their fulfill- 
ment, however slender the mechanical proofs and however contradictory the 
metaphysical speculation which the author himself could at the time adduce in 
support of his new principle. Even at this stage, it can be considered as highly 
probable that it is the universal law pertaining to all processes in nature. ... 
In any case, the general validity of the principle of least action seems to me 
assured, since it may claim a higher place as a heuristic and guiding principle 
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in our endeavor to formulate the laws governing new classes of phenomena. 
Helmholtz (1887). ' " 

4.8 Modern Metaphysics 

In this section we continue the investigation on the use and nature of metaphysical prin- 
ciples in theoretical Physics. Like many others adjectives, the word metaphysical has 
acquired both a positive (meliorative, eulogistic, appreciative) and a negative (pejorative, 
derogatory, unappreciative) connotation. 

Logical positivism or logical empiricism was a mainstream school in the philosophy of 

science of the early 20th century. One of the objectives of the positivist school was to 
build science from empirical (observable) concepts only. According to this point of view 
every metaphysical, that is, non-empirical or non-dircctly observable, entity is cognitively 
meaningless and all teleological principles were perceived to fall in this category. 

Teleological arguments were also perceived as problematic in Biology and related fields 
due to the frequent abuse of phony teleological arguments, usually in the form of crude 
fallacies or obvious tautologies, given to provide support to whatever statement in need. 
Maupertuis, the proponent of the first general least action principle, himself, was aware 
of such problems, as clearly stated in the text of his quoted in the previous section. Why 
then did important theoretical physicists insist in keeping teleological arguments and other 
kinds of principles perceived as metaphysical among the regular tools of the trade? 

Yourgrau and Mandelstam (1979, p. 10) emphasize the heuristic importance of meta- 
physical principles in the early development of prominent physical theories: 

"In conformity with the scope of our subject, the speculative facets of the 
thinkers under review have been emphasized. Historically by far more conse- 
quential were the positive contributions to natural science, contributions which 
transferred the emphasis from 'a priori' reasoning to theories based upon ob- 
servation and experiment. Hence, while the future exponents of least principles 
may have been guided in their metaphysical outlook (1) by the idealistic hack- 
ground we have described, they had, nevertheless, to present their formulations 
in such fashion that the data of experience would thus he explained. A system- 
atic scrutiny of the individual chronological stages in the evolution of minimum 
principles can furnish us with profound insight into continuous transformation 
of a metaphysical canon to an exact natural law. 

(1) By 'metaphysical outlook' we comprehend nothing hut those general 
assumptions which are accepted by the scientist. " 

The definition of Metaphysics used by Yougrau is perhaps a bit too vague, or too 
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humble. Wc believe that a deeper understanding of the role played by metaphysics in 
modern theoretical physics can be found (emphases are ours) in Einstein (1950): 

"We have become acquainted with concepts and general relations that enable 
us to comprehend an immense range of experiences and make them accessible 
to mathematical treatment. ... 

(but) Why do we devise theories at all? The answer to the latter question 
is simply: Because we enjoy comprehending, i.e., reducing phenomena by 
the process of logic to something already known or (apparently) evident. ... 

This is the striving toward unification and simplification of the premi- 
ses of the theory as a whole (Mach's principle of economy, interpreted as a 
logical principle) . ... 

There exists a passion for comprehension, just as there exists a passion 
for music. That passion is rather common in children, but gets lost in most 
people later on. Without this passion, there would be neither mathematics nor 
natural science. Time and again the passion for understanding has led to the 
illusion that man is able to comprehend the objective world rationally, by pure 
thought, without any empirical foundations-in short, by metaphysics. I believe 
that every true theorist is a kind of tamed metaphysicist, no matter how pure a 
'positivist' he may fancy himself. The metaphysicist believes that the logically 
simple is also the real. The tamed metaphysicist believes that not all that is 
logically simple is embodied in experienced reality, but that the totality of all 
sensory experience can be 'comprehended' on the basis of a conceptual system 
built on premises of great simplicity. The skeptic will say that this is a 'miracle 
creed. ' Admittedly so, but it is a miracle creed which has been borne out to an 
amazing extent by the development of science. " 

Even more resolute statements are made by Max Planck (emphases are ours) in the 
encyclopedia Die Kultur der Gegenwart (1915, p. 68), and in Planck (1915, p. 71-72): 

"As long as there exists physical science, its highest desirable goal had been 
the solution of the problem to integrate all natural phenomena observed 
and still to be observed into a single simple principle which permits to 
calculate all past and, in particular, all future processes from the present 
ones. It is natural that this goal has not been reached to date, nor ever will it 
be reached entirely. It is well possible, however, to approach it more and more, 
and the history of theoretical physics demonstrates that on this way a rich 
number of important successes could already be gained; which clearly indicates 
that this ideal problem is not merely utopical, but eminently fertile. Among the 
more or less general laws which manifest the achievements of physical science 
in the course of the last centuries, the Principle of Least Action is probably the 
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one which, as regards form and content, may claim to come nearest to that 
final ideal goal of theoretical research. " 

"Who instead seeks for higher connections within the system of natural 
laws which are most easy to survey, in the interest of the aspired harmony 
will, from the outset, also admit those means, such as reference to the events 
at later instances of time, which are not utterly necessary for the complete 
description of natural processes, but which are easy to handle and can be 
interpreted intuitively. " 

From the last quoted statements of Einstein and Planck we can draw the following 
four points list of motivations for the use of (or for defining the characteristics of) good 
metaphysical principles: 

1- Simplicity; 2- Generality; 3- Interpretability; and 

4- Derivation of powerful and easy to handle (calculate, compute) symbolic (mathe- 
matical) formalisms. 

The first three these points are very similar to the characteristics of good metaphorical 
arguments, as analyzed in section 3. In this particular context, generality means the 
ability of crossing over different areas or transferring knowledge between multiple fields 
to integrate the understanding of different natural phenomena. Since the least action 
principle clearly conforms with all four criteria in the above list, it is easy to understand 
why it is so endeared by physicists, despite the objections to its teleological nature. 

Up to this point we have been arguing that the laws of mechanics in integral form, 
stated in terms of the least action principle, and its associated teleological metaphysical 
concepts, should be accepted along side with the "standard" formulation of mechanics 
in differential form, that is, the differential equations of Newtonian mechanics. However, 
Schlick (1979, V.l, p. 297) proposes a complete inversion of the empirical / metaphysical 
status of the two formulations, see also Muntean (2006) and Stoltzner (2003). According 
to Schlick's view, while the integral or macro-law formulation has its grounds in observable 
quantities, the differential or micro-law formulation is based on non-empirical concepts: 

"That the event at a point depends only on those processes occurring in its 
immediate temporal and special neighborhood, is expressed in the fact that space 
and time appear in the formulae of natural laws as infinitely small quantities; 
these formulae, that is, are differential equations. We can also describe them 
in a readily intelligible terminology as micro- laws. Through the mathematical 
process of integration, there emerge from them the macro-laws (or integral 
laws), which now state natural dependencies in their extension over spatial 
and temporal distances. Only the latter fall within experience, for the infinitely 
small is not observable. The differential laws prevailing in nature can therefore 
be conjectured and inferred only from the integral laws, and these inferences 
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are never, strictly speaking, univocal, since one can always account for the 
observed macro-laws by various hypotheses about the underlying micro-laws. 
Among the various possibilities we naturally choose that marked by the greatest 
simplicity. It is the final aim of exact science to reduce all events to the fewest 
and simplest possible differential laws. " 

From this and other examples presented in sections 6 to 9, we come to the conclu- 
sion that metaphysical concepts are unavoidable, regardless of the formulation in use. 
Positivists, on the other hand, envision the exclusive use of metaphysical free scientific 
concepts, with grounds on pure empirical experience. At the end, it seems that the later 
devote themselves to the worthless pursuit of chasing chimeras. Moreover, metaphysi- 
cal arguments are essential to build our intuition. Without intuition, physical reasoning 
would be downgraded to merely cranking the formalism, either by algebraic manipulation 
of the symbolic machinery or by sheer number crunching. Planck (1950, p. 171-172), states 
that: 

"To be sure, it must be agreed that the positivistic outlook possesses a dis- 
tinctive value; for it is instrumental to a conceptual clarification of the signifi- 
cance of physical laws, to a separation of that which is empirically proven from 
that which is not, to an elimination of emotional prejudices nurtured solely by 
customary views, and it thus helps to clear the road for the onward drive of 
research. But Positivism lacks the driving force for serving as a leader on this 
road. True, it is able to eliminate obstacles, but it cannot turn them into a 
productive factors. For its activity is essentially critical, its glace is directed 
backward. But progress, advancement requires new associations of ideas and 
new queries, not based on the results of measurements alone, but going beyond 
them, and toward such things the fundamental attitude of Positivism is one of 
aloofness. 

Therefore, up to quite recently, positivists of all hues have also put up the 
strongest resistance to the introduction of atomic hypotheses .... " 

At this point it is opportune to remember Kant's allegory of breathing, that offers a 
couterpoint in contrast and complement to his allegory of the dove (Prolegomena to Any 
Future Metaphysics; How Is Metaphysics Possible As a Science?): 

"That the human mind will ever give up metaphysical researches is as little 
to be expected as that we should prefer to give up breathing altogether, to avoid 
inhaling impure air." 
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4.9 Averaging over All Possible Worlds 

The last example quoted by Planck provides yet another excellent illustration to enlighten 
not only the issue currently under discussion, but also other topics we want to address. 
In the next section we shortly introduce one of the most important models related to the 
debate concerning the atomic hypothesis, namely, Brownian motion. 

We are interested in the Dirichlet problem of describing the steady state temperature 
at a two dimensional plate, given the temperature at its border. The partial differential 
equation that the temperatures, u{x,y), must obey in the Dirichlet problem is known as 
the 2-dimensional Laplace equation, 

div grad u = ^-^ + — ^ = , 
ox ay 

as in Butkov (1968, Ch.8). 

From elementary calculus, see Demidovich and Maron (1976), we have the forward 
and backward finite difference approximations for a partial derivative, 

d u u{x + h,y) — u{x, y) u{x, y) — u{x — h, y) 
dx h h 

Using these approximations twice, we obtain the symmetric or central finite difference 
approximation for the second derivatives, 

d u{x + h,y) — 2u{x, y) + u{x — h, y) 

dx^ ' 

d "^u u{x, y + h) — 2u{x, y) + u{x, y — h) 

. 

dy"^ 

Substitution in the Laplace equation gives the "next neighbors' mean value" equation, 
m(x, ?/) = - {u{x + h,y) + u{x — h,y) + u{x, y + h) + u{x, y — h)) . 

Prom the last equation we can set a linear system for the temperatures in a rectangular 
grid. The unknown variables, in the left hand side, are the temperatures at the interior 
points of the grid, in the right hand side we have the known temperatures at the boundary 
points. 

From the temperatures at the four neighboring points of a given grid point, an 
estimate of the temperature, u{x,y), at this point is the expected value of the random 
variable Z{x,y) whose value is uniformly sampled from 

{u{x + h, y),u{x - h, y),u{x, y + h),u{x, y - h)} , 



the north, south, east and west neighbors. Also, if we did not know the temperature at 
the neighboring point sampled, we could estimate the neighbor's temperature by sampling 
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one of the neighbor's neighbors. Using this argument recursively, we could estimate the 
temperature u{x,y) through the following Monte Carlo algorithm: 

Consider a "particle" undergoing a symmetric random (or drunken sailor) walk, that 
is, a stochastic trajectory, T = [T(l), . . . T(m)], such that starting at position T(l) = 
[x(l), 1/(1)], it jumps to positions T(1),T(2),. . . T(m) by uniformly sampling among the 
neighboring points of its current position, until it eventually hits the boundary. More 
precisely, from a given position, T{k) = [x{k),y{k)], at step k, the particle will equally 
likely jump to one of its neighboring positions at step k + 1, that is, 

T{k + 1) = [x{k + l),y{k + l)] 

is randomly selected from the set 

{ [x(k) + h, y(k)] , [x(k) - h, y(k)] , [x(k),y(k) + h] , [x(k),y(k) - h] } . 

The journey ends when a boundary point, T(m) = [x{m),y{m)], is hit by "particle" at 
(random) step m. Defining the random variable Z{T) — u{x{m),y{m)), it can be shown 
that the expected value of Z{T), for T starting at T(l) = [a;(l), y(l)], equals u{x{l),y{l)), 
the solution to the Dirichlet problem at [x(l), y(l)]. 

The above algorithm is only a particular case of more general Monte Carlo algorithms 
for solving hnear systems. For details see Demidovich and Maron (1976), Hammersley 
and Handscomb (1964), Halton (1970) and Ripley (1987). Hence, these Monte Carlo 
algorithms allow us to obtain the solution of many continuous problems in terms of an 
expected (average) value of a discrete stochastic flow of particles. More precisely, efficient 
Monte Carlo algorithms are available for solving linear systems, and many of the mathe- 
matical models in Physics, or science in general, are (or can be approximated by) linear 
equations. Consequently, one should not be surprised to find physical models interpreta- 
tions in terms of particle flows. 

In 1827, Robert Brown observed the movement of plant spores (pollen) immersed in 
water. He noted that the spores were in perpetual movement, following an erratic or 
chaotic path. Since the motion persisted over long periods of time on different liquid 
media and powder particles of inorganic minerals also exhibited the same motion pattern, 
he discarded the hypothesis of live or self propelled motion. This "Brownian motion" 
was the object of several subsequent studies, linking the intensity of the motion to the 
temperature of the liquid medium. For further readings, see Brush (1968) and Haw (2002). 

In 1905 Einstein published a paper in which he explains Brownian motion as a fluctu- 
ation phenomenon caused by the collision of individual water molecules with the particle 
in suspension. Using a simplified argument, we can model the particle's motion by a 
random path in a rectangular grid, like the one used to solve the Dirichlet problem. In 
this model, each step is interpreted as a molecule collision with the particle, causing it 
to move, equally likely, to the north, south, east or west. The stating the formal math- 
ematical properties of this stochastic process, known as a random walk, was one of the 
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many scientific contribution of Norbert Wiener, one of tlie forefatliers of Cybernetics, 
see Wiener (1989). For good reviews, see Beran (1994) and Embreclits (2002). For an 
elementary introduction, see Berg (1993), Lemons (2002) MacDonald (1962) and Mikosch 
(1998). 

A basic assumption of tlie random walk model is that distinct collisions or moves 
made by the particle are uncorrelated. Let us consider the one dimensional random walk 
process, where a particle, initially positioned at the origin, uq = 0, undergoes incremental 
unitary steps, that is, yt+i = yt + Xt, and Xt — ±1. The steps are assumed unbiased, 
and uncorrelated, that is, E{xt) — and Cov{xs,Xt) — 0. Also, Var(xt) = 1. Prom the 
hnearity of the expectation operator, we conclude that E{yt) — 0. Also 

E{y^) = E ' = E E ^1 + E E,^, ^^^^ = t + = 

so that at time the standard deviation of the particle's position is 

/B(y|)=i^ fori/=i. 

From this simple model an important characteristic, expressed as a sharp statistical 
hypothesis to be experimentally verified, can be derived: Brownian motion is a self-similar 
process, with scaling factor, or Hurst exponent, H — 1/2. One possible interpretation of 
the last statement is that, in other to make coherent observations of a Brownian motion, 
if time is rescaled by a factor 0, then space should also be rcscaled by a factor cf)^ . The 
generalization of this stochastic process for < i7 < 1, is known as fractional Brownian 
motion. 

The sharp hypothesis if = 1/2 takes us back to the eternal underlying theme of system 
coupling / decoupling. While regular Brownian motion was built under the essential axiom 
of decoupled (uncorrelated) increments over non-overlapping time intervals, the relaxation 
of this condition, without sacrificing self-similarity, leads to long range correlations. For 
fresh insight, see the original work of Paul Levy (1925, 1948, 1954, 1970) and Benoit 
Mandelbrot (1983); for a textbook, see Beran (1994) and Embrechts (2002). 

As we have seen in this section, regular Brownian motion can be very useful in modeling 
the low level processes often found in disorganized physical systems. However, in several 
phenomena related to living organisms or systems, long range correlations are exhibited. 
This is the case, for example, in the study of many complex or (self) organized systems, 
such as colloids or liquid crystals, found in soft matter science, in the development of 
embryos or social and urban systems, in electrocardiography, electroencephalography or 
other monitoring of biological signals procedures. Modeling in many of these areas can, 
nevertheless, benefit from the techniques of fractional Brownian motion, as seen in Addi- 
son (1997), Beran (1994), Bunde and Havlin (1994), Embrechts (2002) and Feder (1988). 
Some of the epistemological consequences of the mathematical and computational models 
introduced in this section are commented in the following section. 
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4.10 Hypothetical versus Factual Models 

The Monte Carlo algorithms introduced in the last section are based on the stochastic 
flow of particles. Yet, these particles can be regarded as mere imaginary entities in a 
computational procedure. On the other hand, some models based on similar ideas, such 
as the kinetic theories of gases, or the random walk model for the Brownian motion, seem 
to give these particles a higher ontological status. It is thus worthwhile to discuss the 
epistemological or ontological status of an entity in a computational procedure, like the 
particles in the above example. 

This discussion is not as trivial, innocent and harmless, at it may seem at first sight. 
In 1632 Galileo Gahlei published in Florence his Dialogue Concerning the Two Main 
World Systems. At that time it was necessary to have a license to publish a book, the 
imprimatur. Galileo had obtained the imprimatur from the ecclesiastical authorities two 
years earlier, under the explicit condition that some of the theses presented in the book, 
dangerously close to the heliocentric heretical ideas of Nicolas Copernicus, should be 
presented as a "hypothetical model" or as a "calculation expedient" as opposed to the 
"truthful" or "factual" description of "reality" . 

Galileo not only failed to fulfill the imposed condition, but also ridiculed the official 
doctrine. He presented his theories in a dialogue form. In these dialogues, Simplicio, 
the character defending the orthodox geocentric ideas of Aristotle and Ptolemy, was con- 
stantly mocked by his opponent, Salviati, a zealot of the views of Galileo. In 1633 Gahleo 
was prosecuted by the Roman Inquisition, under the accusation of making heretical state- 
ments, as quoted from Santillana (1955, p. 306-310): 

"The proposition that the Sun is the center of the world and does not move 
from its place is absurd and false philosophically and formally heretical, because 
it is expressly contrary to Holy Scripture. The proposition that the Earth is 
not the center of the world and immovable but that it moves, and also with 
a diurnal motion, is equally absurd and false philosophically and theologically 
considered at least erroneous in faith. " 

In the Italian renaissance, one of the most open and enlighten societies of its time, but 
still within a pre-modern era, where subsystems were only incipient and not clearly differ- 
entiated, the consequences of mixing scientific and religious arguments could be daring. 
Galileo even uses some arguments that resemble the concept of systemic differentiation, 
for example: 

"Therefore, it would perhaps be wise and useful advice not to add without 
necessity to the articles pertaining to salvation and to the definition of faith, 
against the firmness of which there is no danger that any valid and effective 
doctrine could ever emerge. If this is so, it would really cause confusion to 
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add them upon request from persons about whom not only do we not know 
whether they speak with heavenly inspiration, but we clearly see they are defi- 
cient in the intelligence necessary first to understand and then to criticize the 
demonstrations by which the most acute sciences proceed in confirming similar 
conclusions." Finocchiaro (1991, p. 97). 

The paragraph above is from a letter of 1615 from Gahleo to Her Serene Highness 
Grand Duchess Cristina but, as usual, Galileo's rhetoric is anything but serene. In 1633 
Galileo is sentenced to prison for an indefinite term. After he abjures his allegedly heretical 
statements, the sentence is commuted to house-arrest at his villa. Legend has it that, after 
his formal abjuration, Galileo muttered the now celebrated phrase, 

Eppur si mouve, "But indeed it (the earth) moves (around the sun)" . 

Around 1610 Galileo built a telescope (an invention coming from Netherland) that 
he used for astronomical observations. Among his findings were four satellites to planet 
Jupiter, namely, lo, Europa, Ganymedes and Callisto. He also observed phases (such 
as the lunar phases) exhibited by planet Venus. Both facts are either compatible or ex- 
plained by the Copcrnican heliocentric theory, but problematic or incompatible with the 
orthodox Ptolemaic geocentric theory. During his trial, Galileo tried to use these observa- 
tions to corroborate his theories, but the judges would not, literally, even 'look' at them. 
The church's chief astronomer, Christopher Clavius, refused to look through Gahleo's 
telescope, stating that there was no point in 'seeing' some objects through an instrument 
that had been made just in order to 'create' them. Nevertheless, only a few years after 
the trial, the same Clavius was building fine telescopes, used to make new astronomical 
observations. He took care, of course, not to upset his boss with "theologically incorrect" 
explanations for what he was observing. 

From the late 19th century to 1905 the world witnessed yet another trial, perhaps 
not so famous, but even more dramatic. Namely, that of the atomistic ideas of Ludwig 
Boltzmann. For a excellent biography of Boltzmann, intertwined (as it ought to be) 
with the history of his scientific ideas, see Cercignani (1998). The final verdict on this 
controversy was given by Albert Einstein in his annus mirabilis paper about Brownian 
Motion, together with the subsequent experimental work of Jean Perrin. For details see 
Einstein (1956) and Perrin (1950). A simplified version of these models was presented 
in the previous section, including a "testable" sharp statistical hypothesis, H — 1/2, to 
empirically check the theory. As quoted in Brush (1968), in his Autobiographical Notes, 
Einstein states that: 

"The agreement of these considerations with experience together with Planck's 
determination of the true molecular size from the law of radiation (for high 
temperatures) convinced the skeptics, who were quite numerous at that time 
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(Ostwald, Mach) of the reality of atoms. The antipathy of these scholars to- 
wards atomic theory can indubitably be traced back to their positivistic philo- 
sophical attitude. This is an interesting example of the fact that even scholars 
of audacious spirit and fine instinct can be obscured in the interpretation of 
facts by philosophical prejudices. The prejudice - which has by no means died 
out in the meantime - consists in the faith that facts themselves can and should 
yield scientific knowledge without free conceptual construction. 

Such misconception is possible only because one does not easily become 
aware of the free choice of such concepts, which through verification and long 
usage, appear to be immediately connected with the empirical material" 

Let us follow Perrin's perception of the "empirical connection" between the concepts 
used in the molecular theory, which contrasted to that of the rival energetic theory, during 
the first decade of the 20th century. In 1903 Perrin was already an advocate of the 
molecular hypothesis, as can be seen in Perrin (1903). According to Brush (1968, p.30- 
31), Perrin refused the positivist demand for using only directly observable entities. Perrin 
referred to an analogous situation in biology where, 

"the germ theory of disease might have been developed and successfully 
tested before the invention of the microscope; the microbes would have been 
hypothetical entities, yet, as we know now, they could eventually be observed. " 

But only three years latter, was Perrin (1906) confident enough to reverse the at- 
tack, accusing the energetic view rivaling the atomic theory, of having "degenerated into 
a pseudo-religious cult". It was the energetic theory, claimed Perrin, that was making 
use of non-observable entities! To begin with. Classical thermodynamics had a differential 
formulation, with the functions describing the evolution of a system assumed to be contin- 
uous and differentiable (notice the similarity between the argument of Perrin and that of 
Schlick, presented in section 8). Perrin based his argument of the contemporary evolution 
of mathematical analysis when, until late in the 20th century, continuous functions were 
naturally assumed to be differentiable. Nevertheless, the development of mathematical 
analysis, on the turn to the 20th century, proved this to be a rather naive assumption. 
Referring to this background material, Perrin argues: 

"But they still thought the only interesting functions were the ones that can 
be differentiated. Now, however, an important school, developing with rigor 

the notion of continuity, has created a new mathematics, within which the 
old theory of functions is only the study (profound, to be sure) of a group of 
singular cases. It is curves with derivatives that are now the exception; or, 
if one prefers the geometrical language, curves with no tangents at any point 
become the rule, while familiar regular curves become some kind of curiosities, 
doubtless interesting, but still very special. " 
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In three more years, even former opponents were joining the ranks of the atomic theory. 
As W.Nernst (1909, 6th.ed., p.212) puts it: 

"In view of the ocular confirmation of the picture which the kinetic theory 
provides us of the world of molecules, one must admit that this theory begins 
to lose its hypothetical character." 

4.11 Magic, Miracles and Final Remarks 

In several incidents analyzed in the last sections, one can repeatedly find the occurrence of 
theoretical "phase transitions" in the history of science. In these transitions, wc observe a 
dominant and strongly supported theory being challenged by an alternative point of view. 
In a first moment, the cheerleaders of the dominant group come up with a variety of "dis- 
qualifying arguments" , to show why the underdog theory, plagued by phony concepts and 
faulty constructions, should not even be considered as a serious contestant. In an second 
moment, the alternative theory is kept alive by a small minority, that is able to foster its 
progress. In a third and final moment, the alternative theory becomes, quite abruptly, the 
dominant view, and many wonder how is it that the old, now abandoned theory, could 
ever had so much support. This process is captured in the following quotation, from the 
preface to the first edition of Schopenhauer (1818): 

"To truth only a brief celebration of victory is allowed between the two long 
periods during which it is condemned as paradoxical, or disparaged as trivial. " 

Perhaps this is the basis for the gloomier statement found in Planck (1950, p. 33-34): 

"A new scientific truth does not triumph by convincing its opponents and 
by making them see the light, but rather because its opponents eventually die, 
and a new generation grows up that is familiar with it. " 

As for the abruptness of the transition between the two phases, representing the two 
theoretical paradigms, this is a phenomenon that has been extensively studied, from 
sociological, systemic and historical perspectives, by Thomas Kuhn (1996, 1977). See 
also Hoyningen-Huene (1993) and Lakatos (1978a,b). For similar ideas presented within 
an approach closer to the orthodox Bayesian theory, see Zupan (1991). 

We finish this section with a quick and simple alternative explanation, possibly just as 
a hint, that I believe can shed some light on the nature of this phenomenon. Elucidations 
of this kind were used many times by von Foerster (2003,b,e) who was, among many other 
things, a skilful magician and illusionist. 
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An Ambigram, or ambiguous pictTirc, is a picture that can be looked at in two (or 
more) different ways. Loolcing at an ambigram, tfie ol:)server's interpretation or re-solution 
of the image can be attracted to one of two or more distinct eigen-solutions. A memorable 
instance of an ambigram is the Duck-Rabbit, born in 1892, in the humble pages of the 
German tabloid Fliegende Blatter. It was studied in 1899 by the psychologist Joseph 
Jastrow in an article antecipating several aspects of cognitive constructivism, and finally 
made famous by the philosopher Ludwig Wittgenstein in 1953. For a historical account 
of this ambigram, see Kihlstrom (2006), as well as several nice figures. In case anyone 
wonders, Jastrow was Peirce's Ph.D. student and coauthor of the 1885 paper introducing 
randomization, and Wittgenstein is no other than von Foster's uncle Ludwig. 

According to Jastrow (1899), an ambigram demonstrates how 

"True seeing, observing, is a double process, partly objective or outward - 
the thing seen and the retina - and partly subjective or inward - the picture 
mysteriously transferred to the mind's representative, the brain, and there re- 
ceived and affiliated with other images. " 

Still according to Jastrow, in an ambigram, 

"...a single outward impression changes its character according as it is 
viewed as representing one thing or another. In general we see the same thing 
all the time, and the image on the retina does not change. But as we shift 
the attention from one portion of the view to another, or as we view it with a 
different mental conception of what the figure represents, it assumes a different 
aspect, and to our mental eye becomes becomes quite a different thing. " 

Jastrow also describes some characteristics of the mental process of shifting between 
the eigen-solutions of an ambigram, that is, how in "The Mind's Eye" one changes from 
one interpretation to the other. Two of these characteristics are specially interesting in 
our context: 

First, in the beginning, "It may require a little effort to bring about this change, but 
it is very marked when once realized. " 

Second, after both interpretations are known, "Most observers find it difficult to hold 
either interpretation steadily, the fluctuation being frequent, and coming as a surprise. " 

The first characteristic can help us understand either Nernst's "ocular readiness" or, 
in contrast, Clavius' "ocular blindness". After all, the satellites of Jupiter were quite 
tangible objects, ready to be watched through Galileo's telescope, whereas the grains of 
colloidal suspension that could be observed with the lunette of Perrin's apparatus provided 
a much more indirect evidence for the existence of molecules. Or maybe not, after all, it 
all depends on what one is capable, ready, or willing to see... 
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The second characteristic can help us understand Leibniz' and Maupertuis' wilhngness 
to accommodate and harmonize two alternative explanations for a single phenomenon, 
that is, to have effective and final causes, or micro and macro versions of physical laws. 

Yet, the existence of sharp, stable, separable and composable eigen-solutions for the 
scientific system in its interaction with its environment, goes far beyond our individual 
or collective desire to have them there. 

These eigen solutions are the basis upon which technology builds much of the world 
we live in. How well do the eigen-solutions used in these technological gadgets conform 
with von Foerster criteria? Well, the machine I am using to write this chapter has a 2003 
Intel Pentium CPU carved on a silicon waffle with a "precision" of 0.000,000,1m, and 
is "composed" by about 50 miUion transistors. This CPU has a clock of IGHz, so that 
each and every one of the transistors in this composition must operate synchronously to 
a fraction of a thousandth of a thousandth of a thousandth of a second! 

And how well do the eigen-solutions expressed as fundamental physical constants, upon 
which technological projects rely, conform with von Foerster criteria? Again, some of these 
constants are known up to a precision (relative standard uncertainty) of 0.000,000,001, 
that is, a thousandth of a thousandth of a thousandth! The world wide web site of the 
United States' National Institute of Standards and Technology, at www.physics .nist .gov, 
gives an encyclopaedic view of these constants and their inter-relations. Planck (1950, 
Ch.6) comments on their epistemological significance. 

But far beyond their practical utility or even their scientific interest, the existence 
of these eigen-solutions are not magical illusions, but true miracles. Why "true" mira- 
cles? Because the more they are explained and the better they are understood, the more 
wonderful they become! 
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Chapter 5 



Complex Structures, Modularity, 
and Stochastic Evolution 



"Hierarchy, I shall argue, is one of the central struc- 
tural schemes that the architect of complexity uses. " 

"The time required for the evolution of a complex form 
from simple elements depends critically on the number and 
distribution of potential intermediate stable subassemblies. " 

Herbert Simon (1916-2001), 
The Sciences of the Artificial. 

"In order to make some sense here, we must keep an 
open mind about the possibility that for sufficiently 
complex systems, amplitudes become probabilities. 

Richard Feynman (1918-1988), 
Lecture notes on Gravitation. 

5.1 Introduction 

The expression stochastic evolution may seem an oxymoron. After all, evolution indicates 
progress towards complexity and order, while a stochastic (probabilistic, random) process 
seems to be only capable of generating confusion or disorder. The etymology of the word 
stochastic, from aToxo<;, meaning aim, goal or target, and its current use, meaning chancy 
or noisy, seems to incorporate this apparent contradiction. An alternative use of the same 
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root, (jToxocTTiKoq meaning skillful at guessing, conjecturing, or divining the truths may 
offer a bridge between the two meanings. 

The main goal of this chapter is to study how the concepts of stochastic process and 
evolution of complex systems can be reconciled. Sections 2 and 3 examine two prototypical 
algorithms: Simulated Annealing and Genetic Programming. The ideas behind these two 
algorithms will be used as a basis for most of the arguments used in this chapter. The 
mathematical details of some of these algorithms are presented in appendix H. Section 
4 presents the concept of modularity, and explains its importance in the evolution of 
complex systems. 

While sections 2, 3 and 4 are devoted to the study of general systems, including appli- 
cations to biological organisms and technological devices, section 5 pays closer attention 
to the evolution of complex hypotheses and scientific theories. Section 5 also examines 
the idea of complementarity, developed by the physicist and philosopher Niels Bohr as 

a general framework for the reconciliation of two concepts that appear to be incompat- 
ible but arc, at the same time, indispensable to the understanding of a given system. 
Section 6 explores the connection between complementarity and probability, presenting 
Heisenbcrg's uncertainty principle. Section 7 extends the discussion to general theories of 
evolution and returns to the pervasive theme of probabilistic causation. Section 8 presents 
our final remarks. 



5.2 The Ergodic Path: One for All 

Most human societies are organized as hierarchical structures. Universities are organized 
in research groups, departments, institutes and schools; Armies in platoons, battalions, 
regiments and brigades; and so on. This has been the way of doing business as described 
in the earliest historical records. Deuteronomy (1:15) describes the ancient hierarchical 
structure of Israel: 

"So I took the heads (ROSh) of your tribes, men wise and known, and 
made them heads over you, leaders (ShR) of thousands , hundreds, fifties and 
tens, and officers (ShTR) for your tribes." 

This verse gives us some idea of the criteria used to appoint leaders (knowledge and 
wisdom), but give us no hint on the criteria and methods used to form the groups (of 
10, 50, 100 and 1000). Perhaps that was obvious from the family and tribal structure 
already in place. There are many situations, however, where organizing groups to obtain 
an optimal structure is far from trivial. In this section we study such a case: the block 
partition problem. 
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5.2.1 Block Partitions 



The matrix block partition problem arises in many practical situations in engineering 
design, operations research and management science. In some applications, the elements 
of a rectangular matrix. A, may represent the interaction between people, corresponding 
to columns, and activities, corresponding to rows, that is, Aj, the element in row i and 
column j, represents the intensity of the interaction between person j and activity i. The 
block partition problem asks for an optimal ordering or permutation of rows and columns 
taking the permuted matrix to Block Angular Form (BAF), so that each one of b diagonal 
blocks bundles a group of strongly coupled people and activities. Only a small number 
of activities are leaft outside the diagonal blocks, in a special {b + l)-th block of residual 
rows. Also, only a small number of people interact with more than one of the b diagonal 
activities, these corespond to residual columns, see Figure 1. 
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Figure la,b: Two Matrices in Block Angular Form. 

A matrix in BAF is in Row Block Angular Form (RBAF) if it has only residual rows, 
and is in Column Block Angular Form (CBAF) if it has only residual cohimns. Each 
angular block can, in turn, exhibit again a BAF, thus creating a recursive or Nested 
Block Angular Form (NBAF). Figure la exhibits a matrix in NBAF. In this figure, zero 
elements of the matrix are represented by blanck spaces. The number at the position of 
a non-zero element (NZE) is not the corresponding matrix element's value, but rather a 
class tag or "color" indicating the block to which the row belongs. Residual rows receive 
the special color 6+1. The first block has a nested CBAF structure, shown in Figure lb. 
For the sake of simplicity, this chapter will focus on the BAF partition problem, although 
all our conclusions can be generalized to the NBAF case. 

We motivate the block partition problem further with an application related to numer- 
ical linear algebra. Gaussian elimination is the name of a simple method for solving linear 
systems of order n, by reducing the matrix of the original system to (upper) triangular 
form. This is accomplished by successively subtracting multiples of the row 1 through n 
from the rows bellow them, so as to eliminate (zero) the elements below each diagonal 
element (or pivot element). The example in Figure 2 illustrates the Gaussian elimination 
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algorithm, where the original system, Ax = b, is transformed into an upper triangular 
system, Ux = c. The matrix L stores the multipliers used in the process. Each multiplier 
is stored at the position of the element it was used to eliminate, that is, at the position 
of the zero it was used to create. It is easy to check that A = LU, hence the alternative 
name of the algorithm: LU Factorization. 

The example in Figure 2 also displays some structural peculiarities. Matrix A is in 
BAF, with two diagonal blocks, one residual row (at the bottom or south side of the 
matrix) and one residual column (at the right or east side of the matrix). This structure 
is preserved in the L and U factors. This structure and its preservation is of paramount 
importance in the design of efficient factorization algorithms. Notice that the elimination 
process in Figure 2 can be done in parallel. That is, the factorization of each diagonal 
block can be done independently of and simultaneously with the factorization of the other 
blocks, for more details see Stern and Vavasis (1994). 
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Figure 2: A=LU Factorization of CBAF Matrix 

A classic combinatorial formulation for the CBAF partition problem, for a rectangular 
matrix A, m by n, is the Hypergraph Partition Problem (HPP). In the HPP formulation, 
we paint all nonzero elements (NZE's) in a vertex i G {1, . . . ,Tn}, (corresponding to row 
Ai) with a color G {1, . . . , b}. The color q^{x) of an edge j G {1, . . . , n}, (corresponding 
to column A^) is then the set of all its NZE's colors. Multicolored edges of the hyper- 
graph (corresponding to columns of the matrix containing NZE's of several colors) are the 
residual columns in the CBAF. The formulation for the general BAF problem also allows 
some residual rows to receive the special color b + 1. 

The BAF apphcations typically require: 

1. Roughly the same number of rows in each block. 

2. Only a few residual rows or columns. 

From 1 and 2 it is natural to consider the minimization of the objective or cost function 

hk{x) +/3c(x) + 7r(a;) , hk{x) = Sk{x) - m/b , 

k=l 

q\x) = {A; e {1,...,6} : 3i, Al ^ Q ^ Xi ^ k] , su{x) = |{i e {1, . . . ,m} : Xi^ k}\ , 
c(x) = |{j G {l,...,n} : |g-'(a;)| > 2} | , r(x) = |{« G {1, . . . , m} : Xi = b + l}\ . 
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The term c{x) is the number of residual columns, and the term r{x) is the number of 
residual rows. The constraint functions hk{x) measure the deviation of each block from 
the ideal size m/b. Since we want to enforce these constraints only approximately, we use 
quadratic penalty functions, hk{xy, that (only) penalize large deviations. If we wanted 
to enforce the constraints more strictly, we could use exact penalty functions, like \hk{x)\, 
that penalize even small deviations, see Bertzekas and Tsitsikhs (1989) and Luenberger 
(1984). 

5.2.2 Simulated Annealing 

The HPP stated in the last section is very difficult to solve exactly. Technically it is an 

NP-hard problem, see Cook (1997). Consequently, we try to develop heuristic procedures 
to find approximate or almost optimal sohitions. Simulated Annealing (SA) is a powerful 
meta-heuristic, well suited to solve many combinatorial problems. The theory behind SA 
also has profound epistemological implications, that we explore latter on in this chapter. 

The first step to define an SA procedure is to define a neighborhood structure in the 
problem's state or configuration space. The neighborhood, N{x), of a given initial state, 
X, is the set of states, y, that can be reached from x, by a single move. In the HPP, a 
single move is defined as changing the color of a single row, Xi i — > yi. 

In this problem, the neighborhood size is therefore the same, for any state x, namely, 
the product of the number of rows and colors, that is, |A^(a;)| = mb for CBAF, and 
|A^(a;)| = m{b + 1) for BAF. This neighborhood structure provides good mobility in the 
state space, in the sense that it is easy to find a path (made by a succession of single 
moves) from any chosen initial state, x, to any other final state, y. This property is called 
irreducibility or strong connectivity. There is also a second technical requirements for 
good mobility, namely, this set of paths should be aperiodic. If the length (the number 
of single moves) of any path from a; to y is a multiple of an integer k > 1, k is called the 
period of this set. Further details are given in appendix H.l. 

In an SA, it is convenient to have an easy way to update the cost function, computed 
at a given state, x, to the cost of a neighboring state, y. The column color weight matrix, 
W, is defined so that the element counts the number of NZE's in column j (in rows) 
of color k, that is, 

Wl^\{Al\Al^(}Ax, = k}\ . 

The weight matrix can be easily updated at any single move and, from W, it is easy to 
compute the cost function or a cost differential, 

5 ^ f{y) - fix) . 

The internal loop of the SA is a Metropolis sampler, where single moves are chosen 
at random (uniformly among any possible move) and then accepted with the Metropolis 
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probability, 



M{S, 9) = 



1 , if 5 < ; 

exp(-^ S) , if 5 > 



The parameter 6 is known as the inverse temperature, which has a natural interpretation in 
statistical physics, see MacDonald (2006), Nash(1974) and Rosenfeld (2005), for intuitive 
introductions, and Thompson (1972) for a rigorous text. 

The Gibbs distribution, g{9y, is the invariant distribution for the Metropolis sampling 
process, given by 



The symbol g{9) represents a row vector, where the column index, x, spans the possible 
states of the system. 

Consider a system prepared (shuffled) in such a way that the probability of starting 
the system in initial state x is g{9)^. If we move the system to a neighboring state, y, 
according to the Metropolis sampling procedure, the invariance property of the Gibbs 
distribution assures that the probability that the system will land (after the move) in any 
given state, y, is gifi^ , that is, the probability distribution of the final (after the move) 
state remains unchanged. 

Under appropriate regularity conditions, see appendix H.l, the process is also crgodic. 
Ergodicity means that even if the system is prepared (shuffled) with an arbitrary prob- 
ability distribution, f (0), for the initial state, for example, the uniform distribution, the 
probability distribution, f (t), of the final system state after t moves chosen according to 
the Metropolis sampling procedure will be sufficiently close to g{9) for sufficiently large 
t. In other words, the probability distribution of the final system state converges to the 
process' invariant distribution. Consequently, we can find out the process' invariant dis- 
tribution by following, for a long time, the trajectory of a single system evolving according 
to to the Metropohs sampling procedure. Hence the expression. The Ergodic Path: One 
for All. From the history of an individual system we can recover important information 
about the whole process guiding its evolution. 

Let us now study how the Metropolis process can help us finding the optimal (minimum 
cost) configuration for such a system. The behavior of the Gibbs distribution, g{9), 
changes according to the inverse temperature parameter, 9: 

- In the high temperature extreme, 1/9 oo, the Gibbs distribution approaches the 
uniform distribution. 

- In the low temperature extreme, 1/^ ^ 0, the Gibbs distribution is concentrated in the 
states with minimum cost only. 

Correspondingly the Metropolis process behaves as follows: 

- At the high temperature extreme, the Metropolis process becomes insensitive to the 
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value of the cost function, wandering (uniformly) at random in the state space. 

- At the low temperature extreme, the Metropolis process becomes very sensitive to the 
value of the cost function, accepting only downhill moves, until it reaches a local optimum. 

The central idea of SA involves the use intermediate temperatures: 

- At the beginning use high temperatures, in order to escape the local optima, see Figure 
3a (L), placing the process at the deepest valley, and 

- At the end use low temperatures, in order to converge to the global optimum (the local 
optimum at the deepest valley), see Figure 3a (G). 




Figure 3a: L,G- Local and global minimum; M- Maximum; 
S- Short-cut; h,H- Local and global escape energy. 
Figure 3b: A difficult problem, with steep chffs and flat plateaus. 



The secret to play this trick is in the external loop of the SA algorithm, the Cooling 
Schedule. The cooling schedule initiates the temperature high enough so that most of the 
proposed moves are accepted, and then slowly cools down the process, until it freezes at 
an optimum state. The theory of SA is presented in appendix H.l. 

The most important result concerning the theory of SA, states that, under appropriate 
regularity conditions, the process converges to the system's optimal solution as long as 
we use the Logarithmic Cooling Schedule. This schedule draws the t-th move according 
to Metropolis process using temperature 

where A is the maximum objective function differential in a single move and n is the 
minimum number of steps needed to connect any two states. Hence, the cooling constant, 
nA can be interpreted as an estimate of how high a mountain we may need to climb in 
order to reach the optimal position, see Figure 3a(h). 
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Practical implementations of SA usually cool the temperature geometrically, 6 <— 
(1 + e)6, after each batch of Metropolis sampling. The SA is terminated when it freezes, 
that is, when the acceptance rate in the Metropolis sampling drops below a pre-established 
threshold. Further details on such an implementation are given in the next section. 

5.2.3 Heuristic Acceleration 

The Standard Simulated Anneahng (SSA), described in the last section, behaves poorly 
in the BAF problem mainly because it is very difficult to sense the proximity of low cost 
states, see Figure 3b, that is, 

1. Most of the neighbors of a low cost state, x, may have much higher costs; and 

2. The problem is highly degenerate in the sense that there are states, x, with a large 
(sub) neighborhood of equal cost states, S{x) — {y E N{x) \ f{y) — f{x)}. In this 
case, even rejecting all the proposals that would take us out of ^S", would still give 
us a significant acceptance rate. 

Difficulty 2, in particular, implies the failure of the SSA termination criterion: A 
degenerate local minimum (or meta-stable minimum) could trap the SSA into forever, 
sustaining an acceptance rate above the established threshold. 

The best way we found to overcome these difficulties is to use a heuristic temperature- 
dependent cost function, designed to accelerate the SA convergence to the global optimum 
and to avoid premature convergence to locally optimal solutions: 



The state dependent factor in the additional term of the cost function, u{x), can be 
interpreted as an heuristic merit or penalty function that rewards multicolored columns 
for using fewer colors. This penalty function, and some possible variants, have the effect 
of softening the landscape, eroding sharp edges, such as in Figure 3b, into rounded hills 
and valleys, such as in Figure 3a. The actual functional form of this penalty function is 
inspired by the tally function used in the P3 heuristic of Hellerman and Rarick (1971) for 
sparse LU factorization. The temperature dependent parameter, fi{6), gives the inverse 
weight of the heuristic penalty function in the cost function f{x,fi) . 

Function f{x,ii) also has the following properties: (1) f{x,0) = f{x); (2) f{x,iJ,) is 
hnear in 1///. Properties 1 and 2 suggest that we can cool the weight l//x as we cool the 
temperature, much in the same way we control a parameter of the barrier functions in 
some constrained optimization algorithms, see McCormick (1983). 

A possible implementation of this Heuristic Simulated Annealing, HSA, is as follows: 




j,\qi{x)\>l 
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• Initialize parameters /i and set a random partition, x, and initialize the auxiliary 
variables W, q, c, r, s, and the cost and penalty functions, / and h; 

• For each proposed move, x — > y, compute the cost differentials 

So = f{y) - f{x) and 6^ = f{y,jj) - f{x,i^) . 

• Accept the move with the Metropolis probability, M{6^, 6). If the move is accepted, 
update x, PF, g, c, r, s, / and /i; 

• After each batch of Metropolis sampling steps, perform a cooling step update 

e^{l + ei)9, 1x^(1 + 62)1^, < ei < 62 << 1 . 

Computational experiments show that the HSA successfully overcomes the difficulties 
undergone by the SSA, as shown in Stern (1991). As far as we know, this was the first 
time this kind of perturbative heuristic has been considered for SA. Pfiug (1996) gives 
a detailed analysis for the convergence of such perturbed processes. These results are 
shortly reviewed is section H.l. 

In the next section we are going to extend the idea of stochastic optimization to 
that of evolution of populations, following insights from biology. In zoology, there are 
many examples of heuristic merit or penalty functions, often called fitness or viability 
indicators, that are used as auxiliary objective functions in mate selection, see Miller 
(2000, 2001) and Zahavi (1975). The most famous example of such an indicator, the 
peacock's tail, was given by Charles Darwin himself, who stated: "The sight of a feather 
in a peacock's tail, whenever I gaze at it, makes me feel sick!" For Darwin, this case was 
an apparent counterexample to natural selection, since the large and beautiful feathers 
have no adaptive value for survival but are, quite on the contrary, a handicap to the 
peacock's camouflage and flying abilities. However, the theory presented in this section 
give us a key to unlock this mystery and understand the tale of the peacock's tail. 

5.3 The Way of Sex: All for One 

From the interpretation of the cooling constant given in the last section, it is clear that 
we would have a lower constant, resulting in a faster cooling schedule, if we used a richer 
set of single moves. Specially, if the additional moves could provide short-cuts in the 
configuration space, as the moves indicated by the dashed line in Figure 3a. This is one of 
the arguments that can be used to motivate another important class of stochastic evolution 
algorithms. Namely, Genetic Programming, the subject of the following sections. We will 
focus on a special class of problems known as functional trees. The general conclusions, 
however, remain valid in many other applications. 
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5.3.1 Functional Trees 

In this section, we deal with methods of finding the correct specification of a complex 
function. This complex function must be composed recursively from a finite set, OP — 
{opi, op2, ■ ■ ■ opp}, of primitive functions or operators, and from a set, A — {oi, 02, . . .}, of 
atoms. The k-th operator, opk, takes a specific number, r{k), of arguments, also known 
as the arity of opk- We use three representations for (the value returned by) the operator 
opk computed on the arguments Xi,X2, ■ ■ ■ Xr{k) '■ 

OPk 

OPk{Xi, . . . Xr{k)) , / \ , {opk Xi . . . Xr{k)) ■ 

X\ . . . 3^r(fc) 

The first is the usual form of representing a function in mathematics; the second is the 
tree representation, which displays the operator and their arguments as a tree; and the 
third is the prefix, preorder or LISP style representation, which is a compact form of the 
tree representation. 

As a first problem, let us consider the specification of a Boolean function of q variables, 
/(xi, . . . Xg), to mach a target table, g{xi, . . . Xg), see Angeline (1996) and Banzhaf el al. 
(1998). The primitive set of operators and atoms for this problem are: 

OP = {~, A, V,^,0,(g)} and A = {xi, . . . Xg,0,l} . 

Notice that while the first operator (not) is unary, the last five (and, or, imply, nand, xor) 
are binary. 
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The set, OP, of Boolean operators defined above is clearly redundant. Notice, for 
example, that 

Xi — >■ X2 =~ (a^iA ~ X2) , ~ Xi = Xi Xi and X\ A Xi =~ {xi X2) ■ 

This redundancy may, nevertheless, facilitate the search for the best configuration in the 
problem's functional space. 

Example la shows a target table, g{a,b,c). As it is usual when the target function 
is an experimentally observed variable, the target function is not completely specified. 
Unspecified values in the target table are indicated by the don't-care symbol *. The two 
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solutions, fl and /2, match the table in all specified cases. Solution fi, however, is simpler 
and for that may be preferred, see section 4 for further comments. 
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= (~ A) V C , /2 = (~ AA ~ 5) V (A A C) . 

A = (V (~ A) C) , /2 = (V (A ^) (~ S)) (A A C)) . 

Example la: Two Boolean functional trees for the target g{a, b, c). 

As a second problem, let us consider the specification of a function for an integer 
numerical sequence, such as the Fibonacci sequence, presented in Koza (1983). 

fi, ifj = OVj = l; 

The following array, g-' , < j < 20, lists the first 21 elements of the Fibonacci sequence. 
g = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765] . 

In this problem, the primitive set of operators and atoms are: 

OP = {+,-,x,a} , A = {j, 0,1} , 

where j in an integer number, and the first three operators are the usual arithmetic 
operators. The specified function is used to compute the first n + 1 elements of the array 
/■', seeking to mach the target array g-' , < j < n. The last primitive function is the 
recursive operator, a{i,d), that behaves as follows: When computing the j-th element, 
f{j), a{i,d) returns the already computed element f \ if i is in the range, < i < j, or a 
default value, d, if i is out of the range. 

In the functional space of this problem, possible specifications for the Fibonacci func- 
tion in prefix representation, are 

(+ {a (- j 1) 1) {a (- j (+ 1 1) 0))) , (+ {a (- j 1) 1) (+ {a (- j (+ 1 1) 0)))) . 

Example 2a: Two functional trees for the Fibonacci sequence. 
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Since the two expressions above are functionally equivalent, the first one may be 
preferable for being simpler, see section 4 for further comments. 

As a third problem, we mention Polynomial Network models. These functional trees 
use as primitive operators linear, quadratic or cubic polynomials in one, two or three 
variables. For several examples and algorithmic details, see Farlow (1984), Madala and 
Ivakhnenko (1994) and Nikolaev and Iba (2006). Figure 4 shows a simple network used 
for sales forcast, a detailed report is given in Lauretto et al. (1995). Variable is a 
magazine's sales forecast obtained by a VARMA time series model using historic sales, 
econometric and calendric data. Variables xi to X4 are qualitative variables (in the scale: 
Bad, Weak, Average, God, Excellent) to assess the appeal or attractiveness of an individ- 
ual issues of the magazine, namely: (1) cover impact; (2) editorial content; (3) promotional 
items; and (4) point of sale marketing. 




Figure 4: Polynomial Network. 
Rings on a node: 1- Linear; 2- (incomplete) Quadratic; 3- (incomplete) Cubic. 

Of course, the optimization of a Polynomial Network is far more complex than the 
optimization of Boolean or algebraic etworks, since not only topology has to be optimized 
(identification problem), but also, given a topology, the parameters of the polynomial 
function have to be optimaized (estimation problem). Parameter optimization of sub- 
trees can be based on Tikhonov regularization, ridge regression, steepest descent or Partan 
gradient rules. For several examples and algorithmic details, see Farlow (1984), Madala 
and Ivakhnenko (1994), Nikolaev and Iba (2001, 2003, 2006), and Stern (2008). 
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5.3.2 Genetic Programming 

Starting from a given random tree, one can start an SA type search in the problem's 
(topological) space. In GP terminology, the individual's functional specification is called 
its genotype, the individual's expressed behavior, or computed solutions, is called its 
phenotype. Changing a genotype to a neighboring one is called a mutation. The quality 
of a phenotype, its performance, merit or adaptation, is measured by a fitness function. 

While SA looks at the evolution of a single individual, GP looks at the evolution of 
a population. A time parameter, t, indexes the successive generations of the evolving 
population. In GP, individuals typically have short lives, surviving only a few generations 
before dying. Meanwhile, populations may evolve for a very long time. 

In GP an individual may, during its ephemeral life, share information, that is, swap 
(copies) of its (partial) genome, with other individuals. This genomic sharing process is 
called sex. In GP an individual, called a parent, may also participate in the creation of 
a new individual, called its child, in a process called reproduction. In the reproduction 
process, an individual gives (partial) copies of its genotype to its offspring. Reproduction 
involving only one parent is called asexual, otherwise it is called a sexual reproduction. 

In the following list, a set of possible mutation and sex operators are given: 

1- Point leaf mutation: Replace a leaf atom by an other atom. 

2- Point operator mutation: Replace a node operator by a compatible operator. 

3- Shrink mutation: Replace a sub-tree by a leaf with a single atom. 

4- Grow mutation: Replace the atom at a leaf by a random tree. 

5- Permutation: Change the order of the children of a given node. 

6- Gene duplication: Replace a leaf by a copy of a sub-tree. 

7- Gene inversion: Switch two sub-trees. 

8- Crossover: Share or exchange sub-trees between individuals. 

The first five operators, involving only one sub-tree, are sometimes called (proper) 
mutations, while the last three operators, involving two or more separate sub-trees, are 
called recombinations. Also notice that the first seven operators involve only one indi- 
vidual, while crossover involves two or more. This list of mutation and recombination 
operators is redundant but, again, this redundancy may also facilitate the search for the 
best configuration in the problem's functional space. 

We should mention that the terms used to name these operators are not standard in 
the field of GP, and even less so in biology, genetics, zoology and botany. We should also 
mention that the forms of GP presented in this section, do not explore the possibility 
of allowing individuals to carry a (redundant) set of two or more homologous (similar 
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but not identical) specifications (genes), a phenomenon known as diploidy or multiploidy. 
Diploidy is common in eukaryotic (biological) life, and can provide a much richer structure 
and better performance to GP. 

Sexual reproduction can be performed by crossover, with parents giving (partial) copies 
of their genome to the children. The following examples show a pair of parents and children 
generated by a single crossover, for some of the problems considered in the last section. 
A square parenthesis in the prefix representation indicates a crossover point. The tree 
representation would indicate the same crossover points by broken edges (=). Notice that 
in these examples the is a child corresponding to a solution presented in the last section. 

/i /2 /s 

V V V 

/ \ / = / \ 

A =1 A =^ A A 

/ \ I b / \ / \ / \ 

rsj r-u d I I rsj | j 

ah a c a b a c 

Example lb: Crossover between Boolean functional trees. 



Parents: (* [a (- j 1) 1] (* j j)) , (+ {a (- j (+ 1 1) 0)) [- j 1] ) ; 

Children: (* [- j 1] (* j i)) , (+ (a (- j (+ 1 1) 0)) [a (- j 1) 1] ) . 

Example 2b: Crossover between arithmetic functional trees. 

Finally, the reproduction and survival selection processes in GP assume that individu- 
als are chosen from the general population according to sampling probabilities called the 
mating (or representation) distribution and the survival distribution, respectively. Some 
general policies used to specify these probability distributions, based on the individual's 
fitness, are given below: 

1- Top Rank Selection: The highest ranking (best fit) individual is selected. 

2- High Pressure Selection: An individual is selected from the population with a 
probability that increases sharply (super-linearly) with its fitness or fitness' rank. 

3- Fitness Proportional Selection: An individual is selected from the population with 
a probability that is proportional to its fitness. 

4- Rank Proportional Selection: An individual is selected from the population with a 
probability that is proportional to its fitness' rank. 

5- Low Pressure Selection: An individual is selected from the population with a prob- 
abihty that increases modestly (sub-linearly) with its fitness or fitness' rank. 
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6- Tournament Selection: A small subset is sampled at random (uniformly) from the 
population, from which the best (one or two) individuals are selected. 

7- Uniform Selection: An individual is selected from the population with uniform 
probability. 

These processes are supposed to mimic biological selection mechanisms, including 
sexual differentiation, like male and female (alleged) behavior, familiar and other sub- 
population structures, etc. 

5.3.3 Schemata and Parallelism 

A possible motivation for developing populational evolutionary algorithms like GP, instead 
of single individual evolutionary algorithms, like straight SA, is to consider a richer and 
better neighborhood structure. The additional moves made available should provide short- 
cuts in the problem's configuration space, lowering the cooling constant and allowing a 
faster convergence of the algorithm. 

The intrinsic parallelism argument, first presented in Holland (1975), proves that, un- 
der appropriate conditions, GP is likely to succeed in providing such a rich neighborhood 
structure. The mathematical analysis of this argument is presented in section H.2, see 
also Reeves (1993, Ch.4 Genetic Algorithms). According to Reeves, 

"The underlying concept Holland used to develop a theoretical analysis of 
his GA [GP] was that o/ schema. The word comes from the past tense of the 
Greek verb exijJ, echo, to have, whence it came to mean shape or form; its 
plural is schemata." (p. 154) 

Schemata are partially specified patterns in a program, like partially specified segments 
of prefix expressions, or partial code for functional sub-trees. The length and order of a 
schema are the distance between the first and last defined position on the schema, and 
the number of defined positions, respectively, see section H.2. The Intrinsic Parallelism 
theorem states that the number of schemata (of order / and length 2/, in binary coded 
programs, in individuals of size n) present in a population of size m, is proportional m^. 
The crossover operator enriches the neighborhood of an individual with the schemata 
present in other individuals of the population. If, as suggested by the implicit parallelism 
theorem, the number of such schemata is large, GP is likely to be an effective strategy. 

Schaffer (1987, p. 89), celebrates this theorem stating that: 

Hhis [intrinsic parallelism] constitutes the only known example of combi- 
natorial explosion working to advantage instead of disadvantage. " 
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Indeed, Schaffer has ample reason to praise Holland's result. Nevertheless, we must 
analyze this important theorem carefully, in order to understand its consequences cor- 
rectly. In particular, we should pay close attention to the unit, u, used to measure the 
population size, m. As shown in detail in section H.2, this unit, u = 2\ is itself exponen- 
tial in the schemata order. Therefore, the combinatorial explosion works to our advantage 
as long as we use short schemata, relative to the log-size of the population. This situation 
is described by Reeves as: 

"Thus the ideal situation for a GA [GP] are those where short, low-order 
schemata combine with each other to form better and better solutions. The 
assumption that this will work is called by Goldberg (1989) the building-block 
hypothesis. Empirical evidence is strong that this is a reasonable assumption 
in many problems." (p. 158) 

One key question we must face in order to design a successful GP application is, 
therefore: How then can we organize our working space so that our programming effort 
can rely on short schemata? 

The solution to this question is well known to computer scientists and software engi- 
neers: Organize the programs hierarchically (recursively) as self-contained (encapsulated) 
building-blocks (modules, functions, objects, sub-routines, etc.). The next section is ded- 
icated to the study of modular organization, and its spontaneous emergence in complex 
systems. 

5.4 Simple Life: Small is Beautiful 

The biological world is an endless source of inspiration for improvements and variations in 
GP (of course, one should also be careful not to be carried away by superficial analogies). 
A nice anthology of introductory articles can be found in the book by Michod and Levin 
(1988), The Evolution of Sex: An Examination of Current Ideas. Let us begin this section 
with an interesting biological example. 

It is a well known phenomenon that bacteria can develop antibiotic resistance. Among 
the most common mechanisms conferring resistance to new antibiotics, one can list: 
Agents that modify or destroy the antibiotic molecular structure; Agents that modify 
or protect the antibiotic targets; New pathways offering alternatives to those blocked 
by the antibiotic action; etc. However, all these mechanisms entail a fitness cost to the 
modified individuals. At the very least, there is the cost of complexity, that is, the cost 
of building and maintaining these new mechanisms. Hence, if the selective pressure of 
the antibiotic presence is interrupted, resistant bacterial populations will often revert to 
non-resistant, see for example Bjorkholm et al. (2001). 
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This biological example can be interpreted as the embodiment of Okcam's razor or 
lex parsimoniae, an epistemological principle stated by the 14th-century English logician 
friar William of Ockham, in the following forms: 

- Entia non sunt multiplicanda praeter necessitate, or 

- Pluralitas non est ponenda sine neccesitate. 

that is, entities should not be created or multiplied without necessity. 

In section 4.1 we will see how well this principle apphes to statistical models, and how 
it can be enforced. In section 4.2 we will examine introns, a phenomenon that at first 
glance appears to contradict Okcam's razor. Nevertheless, we will also see how introns 
allow building blocks to appear spontaneously as an emergent feature in GP. 



5.4.1 Overfitting and Regularization 

This section discusses the use of Okcam's razor in statistical modeling. As an illustrative 
example, we use a standard normal multiple linear regression model. This model states 
that y = XP + u, X n X k, where n is the number of observations, k is the number of 
independent variables, /3 G ] — oo, cxd[*^ is the vector of regression coefficients, and -u is a 
Gaussian white noise such that E{u) = and Cov(-u) = a^I, a G [0,oo[, see DeGroot 
(1970), Hocking (1985) and Zellner (1971). Using the standard diffuse prior p{P,a) = 
1/a, the joint posterior probability density, f{(3,a\y,X), and the MAP (maximum a 
posteriori) estimators for the parameters are given by: 

f{P,a\y,X) = -l-exp(-^((n-^)s2 + (^-^)'X'X(^-^))), 

p = (x'xy'x'y, 

= {y-y)'{y-y)l{n-k) . 



In the polynomial multiple linear regression model of order /c, the dependent variable 
y is explained by the powers through k of the independent variable x, i.e., the regression 
matrix element at row i and column j is Xf — {xiY'^, i — l...n, j — l...k + l. Note 
that the model of order k has dimension d — k + 2, with parameters /3o, /^i, ■ ■ ■ /3fe, and a. 

In the classical example presented in Sakamoto et al. (1986, ch.8), we want to fit a 
linear regression polynomial model of order k, 

y = /3ol + Pix + P2X^ . . . + + Ar(0, al) 

through the n — 21 points, {xi,yi), in Table 1. 
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Table 5.1: Sakamoto's data set for polynomial model 

This example was produced by Sakamoto simulating the i.i.d. stochastic process 

= g{xi) + 0.1 * N{0, 1) , g{x) = exp((a: - 0.3)^) - 1 , 

where the target function, g{x), cannot be expressed exactly as a finite order linear re- 
gression polynomial model. 

Figure 5 presents the target function in the example's range, the data set (Sakamoto's 
set in 5a and a second set generated by the same stochastic process in 5b), and the 
regression polynomials of orders through 5. In this example, all the available data points 
are used to fit the model. An alternative procedure would be to divide the available data 
in two sets, the training set, used to adjust the model, and the test set, used to test the 
model's predictive or extrapolation power. 

Just by visual inspection, one can come to the following conclusions: 

- If the model is too simple, it fails to capture important information available in the 
data, making poor predictions. 

- If the model is too complex, it overfits the training data, that is, the curve f(t) tends 
to become an interpolation curve, but the curve becomes unstable and predicted values 
become meaningless. 

The polynomial regression model family presented in the example is typical, in the 
sense that it offers a class o models of increasing dimension, or complexity. This poses 
a model selection problem, that is, deciding, among all models in the family, the "best" 
adapted to the data. It is natural to look for a model that accomplishes a small empirical 
error, the estimated model error in the training data, Remp- A regression model is esti- 
mated by minimizing the 2-norm empirical error. However, we cannot select the "best" 
model based only on the empirical error, because we would usually select a model of very 
high complexity. In general, when the dimensionality of the model is high enough, the 
empirical error can be made equal to zero by simple interpolation. It is a well known fact 
in statistics (or learning theory), that the prediction (or generalization) power of such 
high dimension models is poor. Therefore the selection criterion has to penalize also the 
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Figure 5a,b: Target function, data points, and polynomial regressions of order to 5; 
o: Data points; o: Target function; *: Best (quadratic) polynomial regression. 

model dimension. This is known as a regularization mechanism. 

Some model selection criteria define Rpen = r{d, n)Remp as a penalized (or regularized) 
error, using a regularization factor, r{d,n), where d is the model dimension and n the 
number of training data points. Common regularization factors, using p — {d/n), are: 

• Akaike's final prediction error: FPE = (1 +p)/(l —p), 

• Schartz' Bayesian criterion: SBC — 1 + ln(n)p/(2 — 2p), 

• Generalized cross validation: GCV = (1 — p)~^, 

• Shibata model selector: SMS — l + 2p, 

All these regularization factors are supported by theoretical arguments as well as by empir- 
ical performance; other common regularization methods are Akaike information criterion 
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(AIC), and Vapnik-Chervoncnkis (VC) prediction error. For more details, see Akaike 
(1970 and 1974), Barron (1984), Breiman (1984), Cherkassky (1998), Craven (1979), 
Michie (1994), Mueller (1994), Shibata (1981), Swartz (1978), Unger (1981) and Vapnik 
(1995, 1998). 

We can also use the FBST as a model selection criterion by testing the hypothesis of 
some of its parameters being null, as detailed in Pereira and Stern (2001). The FBST 
version of Okcam's razor states: 

- Do not include in the model a new parameter unless there is strong evidence that it is 
not null. 

Table 2 presents the empirical error, EMP = \\y — vWi/f^^ foi' models of order k 
ranging from to 5, several regularization criteria previously mentioned as well as the 
Akaike information criterion (AIC), as computed by Sakamoto. Table 2 also presents the 
e-value supporting the hypothesis H : f3k — 0, that is, the hypothesis stating that the 
model is in fact of order k — 1. 
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Table 5.2: Selection Criteria for the Polynomial Model 



Alternative approaches to regularization are given by Jorma Rissanen's MDL (mini- 
mum description length) and Chris Wallace's MML (minimum message length). Following 
an old idea of Audrey Kolmogorov, these criteria make direct use of a program's code- 
length as a measure of complexity, see Rissanen (1978, 1989), Wallace and Boulton (1968) 
and Wallace and Dowe (1999). 

5.4.2 Building Blocks and Modularity 

As seen in section 3, GP can produce polynomial networks that are very similar to the 
polynomial regression models presented in the last section. The main difference between 
the polynomial networks and the regression models lies in their generation process: While 
the regression models are computed by a deterministic algorithm, the GP networks are 
generated by a random evolutionary search. However, if one uses compatible measures 
of performance for the GP fitness function and the regression (penalized or regularized) 
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error, one could expect GP to produce networks that somehow fulfill Okcam's parsimony 
principle. 

Surprisingly, this is not so. P. Angeline (1994, 1996) noted that GP generated net- 
works typically contain large segments of extraneous code, that is, code segments that, if 
removed, do not (significantly) alter the solution computed by the network. Trivial ex- 
amples of extraneous code segments are (-1- s 0) and (* s 1), where s is a sub-expression. 
By their very definition, extraneous code segments cannot (significantly) contribute to 
an individual's fitness, and hence to its survival or mating probabilities. However, An- 
geline noticed that the presence of extraneous code could significantly contribute to the 
expected fitness of the individual's descendents! Apparently, the role of these (sometimes 
very large) patches of inert code is to isolate important blocks of working code, and to 
protect these blocks from being broken at recombination (destructive crossover). 

In biological organisms, the genetic code of eukaryots exhibits similar regions of code 
(DNA) that are or are not expressed in protein synthesis; these regions are called exons 
and introns, respectively. Introns do not directly code amino-acid sequences in proteins, 
nevertheless, they seem to have an important role in the meta-control of the genetic 
material expression and reproduction. 

Subsequent work of several authors tried to incorporate meta-control parameters to 
GP. Iba and Sato (1993, p. 548), for example, propose a meta-level strategy for GP based 
on a self-referential representation, where 

"[a] self-referential representation maintains a meta- description, or meta- 
prescription, for crossover. This meta-genetic descriptions are allowed to co- 
evolve with the gene pool. Hence, genetic and meta-genetic code variations are 
jointly selected. How well the genetic code is adapted to the environment is 
translated by the merit or objective function which, in turn, is used for the im- 
mediate, short-term or individual selection process. How well the genetic and 
meta-genetic code are adapted to each other impacts on the system's evolv- 
ability, a characteristic of paramount importance in long-run survival of the 
species. " 

Functional trees, for example, can incorporate edge annotations, like probability weights, 
linkage compatibility or affinity, etc. Such annotations are meta-parameters used to con- 
trol the recombination of the sub-tree directly bellow a given edge. For example, weights 
may be used to specify the probability that a recombination takes place at that edge, while 
linkage compatibility or affinity tags may be used to identify homologous or compatible 
genes, specifying the possibility or probability of swapping two sub-trees. Other annota- 
tions, like context labels, variable type, etc., may provide additional information about the 
possibility or probability of recombination or crossover, the need of type-cast operations, 
etc. When such metacontrol annotations coevolve in the stochastic optimization process. 
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they may be interpreted as a spontaneusly emergent semantics. Any semantic information 
may, in turn, be used in the design of acceleration procedures based on heuristic merit 
functions, hke the example studied in section 5.2.3. 

Banzahf (1998, ch.6, p. 164), gives a simple example of functional tree annotation: 

"Recently, we introduced the explicitly defined introns (EDI) into GP. An 
integer value is stored between every two nodes in the GP individual. This 
integer value is referred as the EDI value (EDIV). The crossover operator is 
changed so that the probability that crossover occurs between any two nodes in 
the GP program is proportional to the integer value between the nodes. That 
is, the EDIV integer value strongly influences the crossover sites chosen by the 
modified GP algorithm, Nordin et al. (1996). 

The idea behind EDIVs was to allow the EDIV vector to evolve during 
the GP run to identify the building blocks in the individual as an emergent 
phenomenon. Nature may have managed to identify genes and to protect them 
against crossover in a similar manner. Perhaps if we gave the GP algorithm 
the tools to do the same thing, GP, too, would learn how to identify and protect 
the building blocks. If so, we would predict that the EDIV values within a good 
building block should become low and, outside the good block, high. " 

Let us finish this section presenting two interpretations for the role of modularity in 
genetic evolutionary processes. This interpretations are common in biology, computer 
science and engineering, an indication that they provide powerful insights. These two 
metaphors are commonly referred to as: 

- New technology dissemination or component design substitution, and 

- Damage control or repair mechanism. 

The first interpretation is perhaps the more evident. In a modular system, a new design 
for an old component can be easily incorporated and, if successful, be rapidly disseminated. 
A classical example is the replacement of mechanical carburetors by electronic injection 
as the standard technology for this component of gasoline engines in the automotive 
industry. The large assortment of upgrade kits available in any automotive or computer 
store gives a strong evidence of how much these industries rely on modular design. The 
second interpretation explains the possibility for the "continued evolution of germlines 
otherwise destined to extinction", see Michod and Levin (1988). A classic illustration 
related to the damage control and repair mechanisms offered by modular organization is 
given by the Hora and Tempus parable of Simon (1996), presented in section 6.4. 

The lessons learned in this section may be captured by the following dicta of Herbert 
Simon: 

"The time required for the evolution of a complex form from simple ele- 
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ments depends critically on the number and distribution of potential interme- 
diate stable subassemblies." Simon (1996, p. 190). 

"Hierarchy, I shall argue, is one of the central structural schemes that the 
architect of complexity uses." Simon (1996, p. 184). 

5.5 Evolution of Theories 

The last sections presented a general framework for the stochastic evolution of complex 
systems. Figure 6 presents a systemic diagram of biological production, according to this 
framework. This diagram, is also compatible with the current biological theories of life 
evolution, provided it is considered as a schematic simplification focusing on our particular 
interests. 

The comparison of this biological production diagram with the scientific production 
diagram presented in section 1.5. motivates several analogies which may receive further 
encouragement from a comment by Davis and Steenstrup (1987, p. 2): 

"The metaphor underlying genetic algorithms is that of natural evolution. 
In evolution, the problem each species faces is one of searching for beneficial 
adaptations to a complicated and changing environment. The 'knowledge ' that 
each species has gained is embodied in the makeup of the chromosomes of its 
members. " 

According to this view, computational (or biological genetic) programs are perceived 
as coded knowledge acquired by a population. An immediate generalization of this idea is 
to consider the evolution of other corpora of knowledge, embodied in a variety of media. 
Our main interest, given the scope of this book, is in the evolution of scientific theories 
and their supporting statistical models. This is the topic discussed in this and the next 
sections. For some very interesting quahtative analyses related to this subject see Richards 
(1989, appendix II) and Lakatos (1978a,b). 

Section 5.1 considers several ways in which statistical models can be nested, mixed and 
separated. It also analyzes the series-parallel composition of several simpler and (nearly) 
independent models. Section 5.2 is devoted to complementary models. Complementarity 
is a basic form of model composition in quantum mechanics that has received, so far, 
little attention in other application areas. All these forms of model transformation and 
combination should provide a basic set of mutations and recombination operators in an 
abstract modeling space. In this section we focus on the statistical operations themselves, 
leaving some of the required epistemological analyses and historical comments to sections 
6 and 7. 
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Figure 6: Biological production diagram. 



5.5.1 Nested, Mixture, Separate and Series-Parallel Models 

In this subsection we use some examples involving the (two-paramctcr) Wcibull (W2) and 
Gompertz (G2) probability models. The hazard (or failure rate) functions, hw2 and hG2, 
the reliability (or survival) function, rw2 and rG2, and the density function, fw2 and /g2, 
of these models are given by: 




hw2{x I /3, 7) = y a;-^ ; rw2 = exp I - j 1 ; fw2 = ;^x^ exp I - 



13^ 



hG2{x I a. A) = Aa^ ; rG2 = exp ( - — — (a^ - 1) ) ; and /g2 = Aa^ exp ( - — — {a"" - V 

\\oga J \loga 

The parameters: /3 and 7 for the Wcibull model; and A and a for the Gompertz model, 
are known, respectively, as the scale and shape parameters. Notice that h = f /r, and 
r — 1 — F, that is, the reliability function is the complement of the cumulative distribution 
function F. 

These probability models are used in reliability theory to study the characteristics of 
the survival (or life) time of a system, until it first fails (or dies). It can be shown, see 
Barlow and Prochan (1981), Gavrilov (1991, 2001) and appendix H.3, that the WeibuU 
distribution is adequate to describe the survival time of many allopoietic, manufactured or 
industrial systems, while the Gompertz distribution is adequate to describe the life time of 
many autopoietic, biological or organic systems. In this setting, the key difference between 
autopoietic and allopoietic systems is the nature of their ontogenesis or assembling process, 
as described in the next paragraphs. Reasonable assumptions concerning the systems' 
ontogenesis will render either the WeibuU or the Gompertz distributions as asymptotic 
eigen-solutions. 
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The assemblage of allopoietic systems is assumed to be subject to rigid protocols 
of quality control, to assure that parts and components as well as the final product, 
work properly. The good quality of the parts and components allows the use of efficient 
projects and streamlined designs, with little redundancy and practically no waste. Project 
optimization provides the designer of such products the means to minimize the use of 
space, material resources, and even assembling time. The lack of redundancy, however, 
implies that the failure of just a few or even only one small component can disable the 
system. 

In contrast, an autopoietic system is assumed to be self-assembled. The very nature 
of organic ontogenesis does not allow for strict quality control. For example, in embryonic 
or fetal development there is not an opportunity to check individual cells, nor is there a 
mechanism for replacing defective ones. The viability of autopoietic systems relies not on 
quality control of individual components, but on massive redundancy and parallelism. 

Let us now examine more closely some of the details of these statistical models. In so 
doing we will also be able to explain several modes of model composition. 

In the WeibuU model, the scale parameter, 7, is approximately the 63rd lifetime per- 
centile, regardless of the value of the shape parameter. By altering its shape parameter, 
P, the (two-parameter) Weibull distribution can take a variety of forms, see Figure 7 and 
Dodson(1994). Some particular values of the shape parameter are important special cases: 
for /3 = 1, it is the exponential distribution; for /3 = 2, it is the Rayleigh distribution; 
for ^ = 2.5, it approximates the lognormal distribution; for /3 = 3.6, it approximates the 
normal distribution; and for (3 — 5.0, it approximates the peaked normal distribution. 
The flexibility of the Weibull distribution makes it very useful for empirical modeling, 
specially in quality control and rehabihty. The regions /3 < 1, P — 1, and /3 > 1 corre- 
spond to decreasing, constant and increasing hazard rates. These three regions are also 
known as infant mortality, memoryless, and wearout. In the limit case /3 = 1, the Weibull 
degenerates into the Exponential distribution. This (no) aging regime represents a sim- 
ple element with no structure exhibiting, therefore, the memoryless property of constant 
failure rate, /^^^(x | 7) = I/7. 

The affine transformation x = x' + a leads to the (three parameter) Truncated Weibull 
distribution. A location (or threshold) parameter, a > represents beginning observation 
of a Truncated Weibull variate at t = 0, after it has already survived the period [—a, 0[. 
For the sake of comparison, the reliability functions of the (one-parameter) Exponential, 
and the two and three-parameter Weibull distributions are given next; 
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exp 
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Figure 7: Shapes of the Weibull Distribution, h, r and /. 
Parameters: 7 = 1, ^ = 0.5, 1.0, 1.5, 2.0, 2.5, 3.6, 5.0. 



In the example at hand we have three nested models, in which a distribution with 
less parameters (or degrees of freedom) is a special case (a sub-manifold in the param- 
eter space) of a distribution with more parameters (or degrees of freedom): The (one- 
paramctcr) Exponential distribution is a special case of the (two-parameter) Weibull dis- 
tribution which, in turn, is a special case of the (three-parameter) Truncated Weibull 
distribution. Nesting is one of the basic modes of relating different statistical models. For 
examples of the FBST used for model selection in nested models see Ironi at al. (2002), 
Lauretto ct al. (2003), Stern and Zacks (2002). 

The (two-parameter) Weibull distribution has also an important theoretical property: 
Its functional form is invariant by serial composition. If n i.i.d. random variables have 
Weibull distribution, Xi ~ f{x\(3,'~f), then the first failure is a Weibull variate with 
characteristic life ^/v}/^, i.e. Xyi,n] ~ /(a; | /3, 7/^^/^^). This is a key property for its 
characterization as a stable distribution, that is, for the characterization of the Weibull 
distribution as an (asymptotic) eigensolution. For applications in the context of extreme 
value theory, see Barlow and Prochan (1981). 

While a series system fails when its first element fails, a parallel system fails when 
its last element fails. Figure 8 gives the standard graphical representation of series and 
parallel systems. This representation is inspired in circuit theory: While in a serial system 
the current flow is cut if a single element is cut, in a parallel system the current flow is 
cut only if all elements are cut. Series / parallel composition are the two basic modes 
used in Reliability Engineering for structuring and analyzing complex systems. Some of 
the statistical properties of these structures are captured in the form of algebraic lattices, 
see Barlow and Prochan (1981) and Kaufmann et al. (1977). Some of these properties 
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are similar or analog to the compositional rules analyzed in section A. 4 and Borges and 
Stern (2007). The characterization of the Gompertz as a limit distribution for parallel 
systems is given in appendix H.3, following Gavrilov (1991). 

E — \2} — [U 

I ^ 1 

m 

fsl 



Figure 8: Series and Parallel Systems. 

Imagine a situation in which a scientist receives a data bank of observed individual 
lifetimes in a population. The scientist also knows that all individuals in the population 
are of the same nature, that is, the population is either entirely allopoietic. Hi, or au- 
topoietic, H2. Since hypotheses Hi and H2 imply life distributions with distinct functional 
forms, the scientist could use his/her observed life data to decide which hypothesis is cor- 
rect (or more adequate). This situation is known in statistics as the problem of separate 
hypotheses. The scientist could also be faced with a mixed population, a situation in with 
a fraction wi of the individuals are allopoietic, and a fraction W2 of the individuals are 
autopoietic. In this situation the scientist could use his/her observed data to infer the 
fractions or weights, wi and W2, in the mixture model. 

For mixture models in the general, the p.d.f. of the data is a convex linear combination 
of fixed candidate densities. Writting the model's vector parameter as 9 — [w, ipi, . . . ipm], 



f{x I 9) = Wifi{x \ljji) + ... + Wmfm{x \ 1pm) , W > \ wl ^ 1 , 



and the model's likelihood function is 

f{X\9)^lf,^^J2Zi''M^j\'l"'^ ■ 



148 



CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION 



6.5 



5.5 



4.5. 



I 



X' X 'X 
/ / 
/ XX/ 

X XX X 

/ XX 

jX / X 

' ^ ^' 

/ ^X X 
XX ^ X X 
< X X 
X X 
X X 



6.5 



5.5 



X X 
X S<^ X 
' ' X X 
X XX X / 

/' XX 

,S< ' X 

I X X 

XX 1 X-X X 
X >^ ' X 
X X X 

X X 



4.5L 



Figure 9a,b: Mixture models with 1 and 2 bivariate-Normal components. 



In mixture analysis for unsupervised classification, we assume that the data comes from 
two or more subpopulations (classes), distributed under distinct densities. Statistical 
mixture models may also be able to infer the classification probabilities for each data 
point, see Figure 9. In a heterogeneous mixture model, the components in the mixture 
have distinct functional forms. In a homogeneous mixture model, all components in the 
mixture have the same functional form. For several applications of these models, see 
Fraley (1999), Lauretto ct al. (2006, 2007), Robert (1996) and Stephens (1997). 



5.5.2 Complementary Models 



According to Bohr, the word Complementarity is used 



"...to characterize the relationship between experiences obtained by differ- 
ent experimental arrangements and visualizable only by mutually exclusive 
ideas...". (N.Bohr II, Natural Philosophy and Human Cultures, p. 30) 

"Information regarding the behavior of an object obtained under definite 
experimental conditions may, however, ...be adequately characterized as com- 
plementary to any information about the same object obtained by some other 
experimental arrangement excluding the fulfillment of the first conditions. Al- 
though such kinds of information cannot be combined into a single picture by 
means of ordinary concepts, they represent indeed equally essential aspects of 
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any knowledge of the object in question which can be obtained in this domain. " 
(Bohr 1938, p.26). 

In quantum Mechanics, at least from a historical perspective, the most important com- 
plementarity relations are those implied by the wave-particle complementarity or duality 
principle. We have mentioned these complementarity relations in section 3.3, and we will 
examine them again in sections 6 and 7. This principle states that microparticles ex- 
hibit the properties of both particle and waves, even considering that, in classical physics, 
these categories arc mutually exclusive. At the dawn of the XX century, physics had an 
assortment of phenomena that could not be appropriately explained by classical physics. 
In order to explain one of these phenomena, known as the photoelectric effect, Albert 
Einstein postulated in 1905, annus mirabilis, a model in which light, conceived in classi- 
cal physics as electro-magnetic waves, should also be seen as a rain of tiny particles, now 
called photons. Einstein basic hypothesis was that a photon's energy is proportional to 
the light's frequency, E — hu, where the proportionality constant, h, is Planck's constant. 

In 1924, Louis de Broglie generahzed Einstein's hypotheses. Using Einstein's relativis- 
tic relation, E — mc^, the photon's wavelength, A = c/iy, can be written as A = h/{mc), 
where m = E/c? is the effective mass attributed to the photon. A moving particle's 
moment is defined as the product of its mass and velocity, p = mv. Hence, de Broglie 
conjectured that any moving particle has associated to itself a "pilot wave" of wavelength 
A = h/p = h/{mv), see Broglie (1946, ch.IV, Wave Mechanics) for the original argument. 
Just two years later, in 1926, Erwin Schordinger published the paper "Quantization as 
an Eigenvalue Problem", further generalizing these ideas into his (Schordinger's) wave 
equation, the basis for a general theory of Quantum Mechanics, see next section. The de- 
tails of the early developments of Quantum Mechanics can be found in Tomonaga (1962) 
and Pais (1988, ch.l2), but from this brief history it is clear that the general idea of 
complementarity was a cornerstone in the birth of modern physics. 

Nevertheless Bohr believed that complementarity could be a useful concept in many 
other areas. Folse (1985) gives an interesting essay about Bohr's ideas on complementarity, 
including its application to fields outside quantum mechanics. Possible examples of such 
applications are given next: 

"...the lesson with respect to the role which the tools of observation play in 
defining the elementary physical concepts gives a clue to the logical applica- 
tions of notions like purposiveness foreign to physics, but lending themselves 

so readily to the description of organic phenomena. Indeed, on this background 
it is evident that the attitudes termed mechanistic and finalistic do not present 
contradictory views on biological problems, but rather stress the mutually ex- 
haustive observational conditions equally indispensable in our search for an 
ever richer description of life. " (Bohr II, Physical Science and Problems of 
Life, p. 100). 
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"For describing our mental activity, we require, on one hand, an objec- 
tively given content to be placed in opposition to a perceiving subject, while, on 
the other hand, as is already implied in such an assertion, no sharp separa- 
tion between object and subject can be maintained, since the perceiving subject 
also belongs to our mental content. From these circumstances follows not only 
the relative meaning of every concept, or rather of every word, the meaning 
depending upon our arbitrary choice of view point, but also we must, in gen- 
eral, be prepared to accept the fact that a complete elucidation of one and the 
same object may require diverse points of view which defy a unique descrip- 
tion. Indeed, strictly speaking, the conscious analysis of any concept stands in 
a relation of exclusion to its immediate application. The necessity of taking 
recourse to a complementarity, or reciprocal, mode of description is perhaps 
most familiar to us from psychological problems. In opposition to this, the fea- 
ture which characterizes the so-called exact sciences is, in general, the attempt 
to attain to uniqueness by avoiding all reference to the perceiving subject. This 
endeavor is found most consciously, perhaps, in the mathematical symbolism 
which sets up for our contemplation an ideal of objectivity to the attainment of 
which scarcely any limits are set, so long as we remain within a self-contained 
field of applied logic. In the natural sciences proper, however, there can be no 
question of a strictly self-contained field of application of the logical principles, 
since we must continually count on the appearance of new facts, the inclusion 
of which within the compass of our earlier experience may require a revision 
of our fundamental concepts. (Bohr I, The Quantum of Action, p. 96-97). 

Examining some basic concepts of quantum mechanics, L.V.Tarasov (1980, p. 153) 
poses a question concerning the concept of complementarity that is very pertinent in our 
context: 

"A microparticle is neither a corpuscle, nor a wave, but still we employ 
both these images, which mutually exclude each other, for describing a mi- 
croparticle. ... Naturally, this could give rise to a ticklish question: Doesn't 
this mean an alienation of the image from the object, which is fraught with 
a transition to the position of subjectivism? A negative answer to this ques- 
tion is given by the principle of complementarity itself. From the position of 
this principle, pictures mutually excluding one another are used as mutually 
complementary pictures, adequately representing various sides of the objective 
reality called the microparticle. " 

Even considering that Tarasov makes his point from a very different epistemological per- 
spective, his statement fits admirably well into our constructivist framework. Within it 
the objectivity of a complementarity model can be interpreted as follows: Although com- 
plementary, the several views employed to describe an object should still render objective 
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(epistemic) eigensolutions. As always, the whole model will be considered as objective as 
these well characterized eigensolutions, that is, sharp, stable, separable and composable, 
as examined in detail in section 3.5. Of course, the compositionality rules for a given the- 
ory or model must be given by an appropriate formalism. Such a formalism must include 
a full specification of compatibility / incompatibility rules for axioms or statements in the 
theory or model. For an example of this kind of formalism, see Costa and Krause (2004). 

5.6 Varieties of Probability 

This section presents some basic ideas of Quantum Mechanics, providing simple heuristic 
derivations for a few of its basic principles. Its main objective is to discuss the impact of 
Quantum Mechanics on the concept and interpretation of probability models. 

5.6.1 Heisenberg's Uncertainty Principle 

In this section we present Werner Heisenberg's uncertainty principle, derived directly from 
de Broglie's wave-particle complementarity principle. 

A particle with a precise moment, p, has associated to it a pilot wave that is monochro- 
matic, that is, has a single wavelength, A. Hence, this wave is homogeneously distributed 
in space. Let us think of a particle with an uncertain moment, specified by a probability 
distribution, (f){p). What would the distribution, ip{x), of the location of its associated 
pilot wave, be? Assuming that the composition rule for pilot waves is the standard linear 
superposition principle, see Section 4.2, the answer to this question is given by the math- 
ematics of Fourier series and transforms, see Butkov (1968, ch.4 and 7), Byron and Fuller 
(1969, ch.4 and 5) or Sadun (2001, ch.8 and 10). 

The Fourier synthesis of a function, /(x), in the interval [0, L] is given by the Fourier 
series 




The following examples give the Fourier series for the rectangular and triangular spike 
functions, R2h{x) and T2h{x). In order to obtain simpler expressions, the spikes are 
presented at the center of interval [— tt, +7r], the standard interval of length L = 27r shifted 
to be centered at the origin. Figure 10a displays the first 5 even harmonics, cos(nx), for 
wave number n = 1 . . . 5, Figure 10b displays the Fourier coefficients, a„, in the synthesis 
of the triangular spike T2/i(a;), for h = 1.0. Figures 10c and lOd display the triangular 
spike and its Fourier syntheses with the first 2 and the first 5 harmonics. 
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Figure 10: Monochromatic Waves and Superposition Packets. 



, 1 , if — h < X < +h, 2h 2 >r-^oo sin(n/i) , , 

R2hix] = < ^ . . . r 1 = cos(nx) 

0, otherwise m — 7r,7r . tt h ^-^n=i n 



^ , 1 - , if |x| < /. ^ ^ ^ 4^ ^oo / sin(n^ . 

y , otherwise m [— 7r,7rJ. 2-7? tt ^-^n=i y nh J 

It is also possible to express the Fourier series in complex form. Using the complex 
exponential notation, exp(^a;) = cos(a;) + zsin(x), we write 

The trigonometric and complex exponential Fourier coefficients are related as follows 
Co = ^ao , Cn=^{an-ibn) , c-n = ^{an + ibn) , n = l,...oo . 
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The complex form is more symmetric and elegant. In particular, the orthogonality rela- 
tions, 

^in2nx/L^-im27rx/L^^ ^ f ^i{n-m)2-KX / L ^ J ' n = 771 , 

Jo \ , if n 7^ m . ' 

are the key for interpreting the set of complex exponentials, |e*"2'^^/'^}, for wave numbers 
n = — cxo . . . + oo, as an orthogonal basis for the appropriate functional vector space in 
the interval [0, L]. 

If we want to synthctizc functions in the entire real line, not just in a finite interval, 
we must replace Fourier scries by Fourier transforms. The Fourier transform, f{k), of a 
function, f{x), and its inverse transform are defined, respectively, by 

-j^ /"OO /"OO 

f{k) = . / f{x) exp{—ikx)dx and f{x) = . / f{k) cxp{ikx)dk . 



In the Fourier transform the propagation number (or angular frequency), k = n27r/L, 
replaces the wave number, n, used in the Fourier series. The new normalization con- 
stants are defined to stress the duality between the complementary representations of the 
function in state and frequency spaces, x and k. 

As an important example, let us compute the Fourier transform, of a Gaussian distri- 
bution with mean /i — and standard deviation (uncertainty) — o: 



This computation can be checked using the analytic formula of the Gaussian integral. 



exp ( —ax + ox + c) ax — \ — exp 




Figure 11: Uncertainty Relation for Fourier Conjugates. 
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Hence, the Fourier transform of a Gaussian distribution with standard deviation a^. = 
a, is again a Gaussian distribution, but with standard deviation ak = l/o", that is, 
o'x <^k = 1- Figure 11 displays the case a = 1.5. It is also possible to show that this example 
is a best case, in the sense that, for any other function, f{x), the standard deviations of 
the conjugate functions, f{x) and f{k), obey the inequality of the uncertainty principle, 
O'x Cfe > 1, see Sadun (2001, sec. 10. 5). 

In the context of Quantum Mechanics, the best known instance of the uncertainty 
principle gives a lower bound on the product of the standard deviations of the position 
and momentum of a particle, 

crxcrp>^ , — , 6.62606896(33)£; - 34Js . 

Heisenberg's bound is written as a function of the moment, p, instead of the frequency, 
k; this is why in the right hand side of the inequality we have half the reduced Planck's 
constant, h/2, instead of 1, as in Fourier transform conjugate functions. 

Planck's constant dimension is that of action, an energy-time product, like joule-second 
or electron-volt-second. The values above present the best current (2006) estimates for this 
fundamental physical constants, in the format recommended by the Committee on Data 
for Science and Technology, CODATA. The two digits in parentheses denote the standard 
deviation of the last two significant digits of the constant's value. The importance of this 
constant and its representation are further analyzed in the next sections. 



5.6.2 Schrodinger's Wave Equation 

In the last sections we have analyzed de Broglie's complementarity principle, which states 
that any moving particle has associated to itself a "pilot wave" of wavelength A = h/mv. 
In section 4.2 we analyzed some of the basic properties of the classical wave equation, 
displayed below on the left hand side: 

In the classical equation, a; = 27r/A is the wave's angular frequency. What should a 
quantum wave equation equation look like? Schrodinger's idea was to replace the classical 
wavelength by de Broglie's, that is, to use u = 27Tmv/h. Using the definition of the kinetic 
energy of a particle, T = (l/2)mf^, and its relation to V{x) and E, the particle's potential 
and total energy, T = E — V{x), we find the expression for cu^ displayed above on the 
right. 

This is Schrodinger's (time independent) wave equation, which established a firm basis 
for the development of Quantum Mechanics, also known in its early days as "wave me- 
chanics" . One of the immediate successes of Quantum Mechanics was to provide elegant 
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explanations, based of physical first principles, to many known empirical facts of chem- 
istry, like the properties of the periodic table, molecular geometry, etc. Among the books 
providing accessible introductions to QM we mention: the very nice elementary text by 
Enge (1972), the concise introduction by LandshofT (1998), McGervey (1995) which focus 
on wave mechanics, and Heitler (1956) which focus on quantum chemistry. 

Quantum Mechanics was also the basis for the development of completely new tech- 
nologies. Among the most distinguished examples are solid-state or condensed matter 
electronic devices such as transistors, integrated circuits, lasers, liquid crystals, etc.. These 
devices constitute, in turn, the basic components of modern digital computers. Finally, 
one can argue that computer based information processing tools are among the most 
revolutionary technologies introduced in human society, having had an impact in its or- 
ganization comparable only to a handful of other technologies (perhaps the steam and 
internal combustion engines, or electric power), see XX (20xx). 

Nevertheless, all this success was not for free. Quantum Mechanics required the re- 
thinking and re-interpretation of some of the most fundamental concepts of science. In 
this and the next sections we analyze the impact of Quantum Mechanics on the most 
important concept of statistical science, namely, probability. 

Although Scrodinger arrived at the appropriate functional form of a wave equation for 
Quantum Mechanics, the adequate interpretation for the wave function, ip, was given only 
a few months later by Max Born. According to Born's interpretation: The probability 
density of "finding" the particle at position x, is proportional to the square of the wave 
function absolute amplitude, |'0(a;)p. Since, in the general case, ^0 is a complex function, 
the last quantity can also be written as the product of the wave function by its complex 
conjugate, that is, \fp{x)\'^ — 

Prom this interpretation of the wave function, we can understand Max Born's formu- 
lation of 'the core metaphor of wave mechanics', as quoted in Pais (1988, ch.l2, seed, 
p.258), 

"The essence of wave mechanics: 'The motion of particles follows probabil- 
ity laws but the probability itself propagates according to the law of causality." 

This is a revolutionary interpretation, that attributes to the concept of probability a 
new and distinct 'objective' character. Hence, it is interesting to have some insight on 
the genesis of Born's interpretation. Born's own recollections are presented at Pais (1988, 
ch.l2, seed, p.258-259): 

"What made Born take his step? 

In 1954 Born was awarded the Nobel Prize 'for his fundamental research, 
specially for his statistical interpretation of the wave function'. In his ac- 
ceptance speech Bom, then in his seventies, ascribed his inspiration for the 
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statistical interpretation to 'an idea of Einstein's [who] had tried to make the 
duality of particles - light-quanta or photons - and waves comprehensible by 
interpreting the square of the optical wave amplitudes as probability density 
for the occurrence of photons. This concept could at once be carried over to 
the ip -function: IV'P ought to represent the probability density of electrons. ' " 

5.6.3 Classic and Quantum Probability 

One of the favorite metaphors used by the orthodox Bayesian school describes the sci- 
entist's work as a game against nature, with the objective of scoring a good guess on 
"nature's true state" . Imphcit in this metaphor is the assumption that such a "true state 
of nature" exists and is, at least in principle, accessible. In this paradigm, omniscience 
is usually a matter of money, that is, with enough economic resources all pertinent in- 
formation can, at least in principle, be acquired, see Blackwell and Girshick (1954), for 
example. 

"Statistics can be viewed as a game against nature." (p. 75). 

"...games where one of the players is not faced with an intelligent opponent 
but rather with an unknown state of nature." (p. 121). 

"The same theory that served to delineate optimal strategies in games 
played against an intelligent opponent will serve to delineate classes of op- 
timal strategies in games played against nature." (p. 123). 

"What prevents the statistician from getting full knowledge of u [the state 
of nature] by unlimited experimentation is the cost of experiments." (p. 78). 

This paradigm seems incompatible with, or at least very unfriendly to, Horn's proba- 
bilistic interpretation of Quantum Mechanics and Heisenberg's uncertainty principle. We 
believe that, in the context of quantum mechanics, the strictly subjective interpretation 
of probability is, please forgive the pun, a very risky metaphor, and that pushing this 
metaphor where it does not belong will lead to endless paradoxes. In Chapter 7 of his 
book. The Physics of Chance, for example, Charles Ruhla presents the adventures of the 
simple-minded hero Monsieur de La Palice, struggling to understand some basic quantum 
experiments. 

For a strict subjectivist the situation is even worse, and the use of Quantum Mechanics 
is at risk of being considered illegal. A statement giving the current best estimate of 
h (Planck's constant) toghether with its standard deviation was presented in section 
5.6.1. Since h appears at the right hand side of Heisenberg's uncertainty principle, an 
uncertainty about the value of h implies a second order uncertainty. The propagation 
of the uncertainty about the value of fundamental physical constants generates similar 
second order probabilistic statements about the detection, mesurement or observation 
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of quantum phenomena. For example, section 5.7.2 arrives at statements giving the 
(probabilistic) uncertainty of (probabilistic) transition rates. All these are prototypical 
examples of statements that are categorically forbidden in orthodox Bayesian statistics, 
as bombastically proclaimed in the following quotations from Finetti (1977, p. 1,5 and 
1972, p. 190), see also Mosleh and Bier (1996) and Wechsler et al. (2005). 

"Does it make sense to ask what is the probability that the probability of 
a given event has a given value, pi ? ... It makes no sense to state that the 
probability of an event E is to be regarded as unknown in that its true value 

is one of the Pi 's, but we do not know which one. " 

"Speaking of unknown probabilities [or of probability of a probability] must 
be forbidden as meaningless. " 

A similar statement of de Finetti was analyzed in section 4.7. Such an awkward 
position, at least for a modern physicist, was seen by the founding fathers of orthodox 
Bayesian statistics as an unavoidable consequence of the subjectivist doctrine, according 
to which, 

"Probabilities are states of mind, not of nature." Savage (1981, p. 674). 

From a constructivist perspective, fundamental physical constants, including of course 
Planck's constant, correspond to very objective (very sharp, stable, separable and com- 
posable) eigenvalues of Physics' research program, and it is perfectly admissible to speak 
about the uncertainty of their estimated values. Of course that is what physicists need to 
do, and have done for almost a century, regardless of being disapproved by the Bayesian 
orthodoxy (theoretically coherent, but understandably very shy and timid). There have 
also been some attempts to reconcile a strict subjectivist position with modern physics, 
through long and sophisticated translations of simple "crude" statements like the ones 
quoted above. Some of these translations are as bizarre and / or intricately involved as 
similar attempts to translate epistemic probabilistic statements that are categorically for- 
bidden in frequentist statistics into "acceptable" frequentist probabilistic statements, see 
section 2.5 and Rouanet et al. (1998, Preamble). Richard Feynman (2002, p. 14), makes 
the following comments on some ideas behind some of such interpretations: 

"Now, the philosophical question before us is, when we make an observation 
of our track in the past, does the result of our observation become real in 
the same sense that the final state would be defined if an outside observer 
were to make the observation? This is al very confusing, especially when we 
consider that even though we may consistently consider ourselves always to 
be the outside observer when we look at the rest of the world, the rest of the 
world is at the same time observing us, and that often we agree on what we 
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see in each other. Does this mean that my observations become real only when 
I observe an observer observing something as it happens? This is an horrible 
viewpoint. Do you seriously entertain the thought that without observer there 
is no reality? Which observer? Any observer? Is a fly an observer? Is a 
star an observer? Was there no reality before 109 B.C. before life began? 
Or are you the observer? Then there is no reality to the world after you are 
dead? I know a number of otherwise respectable physicists who have bought life 
insurance. By what philosophy will the universe without man be understood? 

In order to make some sense here, we must keep an open mind about the 
possibility that for sufficiently complex systems, amplitudes become probabili- 
ties...." 

In order to provide deeper insight on the meaning of Heisenberg's uncertainty princi- 
ple, let us link it to Noether's theorems, already discussed in section 2.8.1. The central 
point of Noether's theorems lies in the existence of an invariant physical quantity for each 
continuous symmetry group in a physical theory. Heisenberg's uncertainty relation, pre- 
sented in section 6.1, sets a bound on the accuracy with which we can access, by means 
of physical measurements, such symmetry / invariant dual or conjugate pairs. This point 
is further analyzed by Bohr: 

"...we admire Planck's happy intuition in coining the term 'quantum of ac- 
tion' which directly indicates a renunciation of the action principle, the central 
position of which in the classical description of nature he himself has empha- 
sized on more than one occasion. This principle symbolizes, as it were, the 
peculiar reciprocal symmetry relation between the space-time description and 
the laws of conservation of energy and momentum, the great fruitfulness of 
which, already in classical physics, depends upon the fact that one may exten- 
sively apply them without following the course of the phenomena in space and 
time." (p.94 or 210). 

"Indeed, the inevitability of using, for atomic phenomena, a mode of de- 
scription which is fundamentally statistical arises from a closer investigation 
of the information which we are able to obtain by direct measurement of these 
phenomena and the meaning we may ascribe, in this connection, to the appli- 
cation of the fundamental physical concepts... 

Such considerations lead immediately to the reciprocal uncertainty relations 
set up by Heisenberg and applied by him as the basis of a thorough investigation 
of the logical consistency of quantum mechanics." (p. 113-114 or 247-248). 

In the article Space-Time Continuity and Atomic Physics, Bohr (1935, p. 370) further 
explores the relation between quantization and our use of probabilistic language: 



5.7. THEORIES OF EVOLUTION 



159 



"With the forgoing analysis we have described the new point of view brought 
forward by the quantum theory. Sometimes one has described it as leaving 
aside the idea of causality. I think we should rather say that in the quantum 
theory we try to express some laws of nature that lie so deep that they can 
not be visualized, or, which cannot be accounted for by the usual description 
in terms of motion. This state of affairs brings about the fact that we must 
use to a great eoctent statistical methods and speak of nature making choices 
between possibilities. " 

The correct interpretation of probability has been one of the key conceptual prob- 
lems of modern physics. The importance of this problem can be further appreciated in 
the following statement of Paul Dirac, found in (Pais 1986, p. 255), regarding the early 
development of quantum mechanics: 

"This problem of getting the interpretation proved to be rather more difficult 
than just working out the equations. " 

The "correct" interpretation or "best" metaphysics for quantum mechanics, including 
the ontological and epistemological status of probability and the understanding of its 
role in the theory, is an area of strong academic interest and current research, see for 
example Albert (1993, ch.7) for an exposition of David Bohm's interpretation of QM. 
Richard Feynman's path integral formalism, see for example Feynman and Hibbs (1965), 
Honerkamp (1993) and Wiegel (1986), makes it possible to support other alternative 
interpretations. 

Perhaps the most important lesson to be learned from this section is that one must be 
aware of the several possible meanings and interpretations of the concept of probability, 
and that distinct situations may require or benefit from distinct approaches. In the 
best spirit of complementarity, we should even consider the possibility of studying the 
same situation under different perspectives, each one of them providing a positive and 
irreplaceable contribution to our understanding of a whole that is beyond the grasp of a 
single picture ^. 

5.7 Theories of Evolution 

The objective of this section is to highlight the importance of three key concepts that are 
essential to modern theories explaining the evolution of complex systems, and to follow 

^The following quote was brought to my attention by Jean- Yves Beziau: "The ordinary man has 
always been sane because the ordinary man has always been a mystic... He has always cared for truth 
more than for consistency. If he saw two truths that seemed to contradict each other, he would take the 
two truths and the contradiction along with them." Gilbert Keith Chesterton (1874 - 1936). 
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some points in their development and interconnection, namely: (1) the systemic view; 
(2) modularity; and (3) stochastic evolution and/or probabilistic causation. Probabilistic 
causation is by far the most troublesome of these concepts. It is absolutely essential, at 
least in the framework presented in this chapter, to the evolution of complex systems, on 
one hand, but it was not easy for stochastic evolution to make its way as a "legitimate" 
concept in modern science, on the other. We beheve that the historical progress and 
acceptance of the ontological status of these probabihstic concepts is closely related to 
the evolution of epistemological frameworks that can, in turn, strongly influence and be 
influenced by the corresponding statistical theories giving them operational support. 

5.7.1 Systemic View and Probabilistic Causation 

The systemic view has always been part of the biological thinking. The teleomechanics 
school gave particular importance to a systemic view of living organisms, see Lenoir (1989) 
for an excellent historical account. As quoted in Lenoir (1989, p. 220, 221), for example, 
the XVIII century biologist C. Reichert states: 

. . 'we have a systemic product before us,... in which the intimate inter- 
connections of the constituent parts have reached their highest degree. When 
we think about a system, we normally picture ourselves precisely this form of 
systematic product. Concerning such systems Kant said that the parts only ex- 
ist with reference to the whole and the whole, on the other hand, only appears 
to exist for the sake of the parts. ' 

In order to investigate the systematic character of biological organisms Re- 
ichert reminded the readers that it was necessary to have a method appropriate 
to the subject... Reichert could envision only one method to the investigation 
of the living organism which avoids disrupting the intimate interconnections 
of its parts: 

'The systematist is aware both that he proceeds genetically and that he must 
proceed genetically. He is aware that the structure on an organism consists in 
the systematic division or dissection of the germ, which receives a particular 
systematic unity through inheritance, makes it explicit through development 
and transmits it further through procreation. ' " 

These statements express one of the core methodological doctrines of the teleomechan- 
ics school, namely, that to understand the systemic character of the organism, one must 
examine its development. The systemic approach of the teleomechanics school greatly 
contributed to the study of many fields in "Biology" (a word coined within this school), 
facilitating complex analyses and multiscale interconnections. C.F.Kielmeyer, another 
great representative of the teleomechanics school, for example, linked individual and pop- 
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ulational developments in his celebrated biogenic, parallelism, or recapitulation principle 
of Embryology: 

"Ontogeny recapitulates phylogeny. " 

The teleomechanics research program, however, could never overcome (perceived) incom- 
patibility conflicts among some of its basic principles, such as, for example, the conflict 
between the teleological organization of organic systems, on one hand, and the need to 
use only scientifically accepted forms of causal explanation, on the other. Consequently 
the scientists in this program found themselves struggling between deterministic reduc- 
tionist mechanisms and vitalistic explanations, both unable to offer significant scientific 
knowledge or acceptable understanding for the phenomena in study. 

According to the framework for evolution presented in this chapter, the diagnostic for 
this failure is quite obvious, namely, the lack of key conceptual probabilistic ingredients. 
This situation is analyzed in Lenoir (1989, p. 239-241): 

"Only in a universe operating according to probabilistic laws, a universe 
grounded in non- deterministic causal processes, is it possible to harmonize the 
evolution of sequences of more highly organized beings with the principles of 
mechanics. 

Two paths lay open for providing a consistent and rigorous solution to this 
dilemma. One alternative is that of twentieth century science. It is simply 
to abandon the classical notion of cause in favor of a non- deterministic con- 
ception of causality. In the late nineteenth century this was not an acceptable 
strategy. To be sure statistical methods were being introduced into physics with 
great success, but prior to the quantum revolution in mechanics no one was 
prepared to assert the probabilistic nature of physical causes.... 

A second solution to this dilemma is that proposed by teleomechanists. Ac- 
cording to this interpretation rigidly determined causality can be retained, but 
then limits must be placed on the analysis of the ultimate origins of biological 
organization, and certain ground states of purposive or zweckmdssig organiza- 
tion must be introduced. 

In the final analysis the only resolution of their impasse was the construc- 
tion of an entirely new set of conceptual foundations for both the biological 
and the physical sciences which could cut the Gordian knot of chance and 
necessity. " 

The breakthrough of introducing stochastic dynamics in modern theories of evolution 
is perhaps the greatest merit of Charles Darwin. According to Peirce (1893, 183-184): 

"(In) The origin of Species published toward the end of 1859... the idea 
that chance begets order, which is one of the cornerstones of modern physics... 
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was at that time put into its clearest light. " 
The role of probability in Darwin's theories can be best appreciated in his own words: 

"Throughout this chapter and elsewhere I have spoken of selection as the 
paramount power, yet its action absolutely depends on what we in our igno- 
rance call spontaneous or accidental variability. Let an architect be compelled 
to build an edifice with uncut stones, fallen from a precipice. The shape of each 
fragment may he called accidental; yet the shape of each has been determined 
by the force of gravity, the nature of the rock, and the slope of the precipice, 
-events and circumstances, all of which depend on natural laws ; but there is 
no relation between these laws and the purpose for which each fragment is used 
by the builder. In the same manner the variations of each creature are deter- 
mined by fixed and immutable laws; but these bear no relation to the living 
structure which is slowly built up through the power of selection, whether this 
be natural or artificial selection. 

If our architect succeeded in rearing a noble edifice, using the rough wedge- 
shaped fragments for the arches, the longer stones for the lintels, and so forth, 
we should admire his skill even in a higher degree than if he had used stones 
shaped for the purpose. So it is with selection, whether applied by man or by 
nature; for although variability is indispensably necessary, yet, when we look 
at some highly complex and excellently adapted organism, variability sinks to 
a quite subordinate position in importance in comparison with selection, in the 
same manner as the shape of each fragment used by our supposed architect is 
unimportant in comparison with his skill." Darwin (1887, ch.XXI, p. 236) 

In the above passage, the importance given to the systemic view, that is, to the 
living structure of the organism is evident. At the same time, randomness is added as an 
essential provider of raw materials in the evolutionary process. However, there are some 
important points of divergence between the way randomness plays a role in Darwinian 
evolution, and in contemporary theories. Wc highlight three of them: (1) Darwin uses 
only pseudo-randomness; (2) Genetic and somatic components of variation are not clearly 
distinguished; (3) Darwinian variations are continuous. Let us examine these three points 
more carefully: 

1- Darwin used pseudo-randomness, not essential uncertainty. S.J.Gould (p. 684) as- 
sesses this point is as follows: 

"The Victorian age, basking in triumph of an industrial and military might 
rooted in technology and mechanical engineering, granted little conceptual space 
to random events... Darwin got into enough trouble by invoking randomness 
for sources of raw material; he wasn't about to propose stochastic causes for 
change as well!" 
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As far as biological evolution is concerned, pseudo-randomness, as introduced by Dar- 
win, is perfectly acceptable. The real need for the notion of "true" or objective probability, 
as in Quantum Mechanics, was still a few decades in the future. 

2- Darwin didn't have a clear distinction between somatic versus genetic, or external 
versus internal, causes of variations. Winther (2000, p. 425), makes the following com- 
ments: 

"Darwin's ideas on variation, hereditarity, and development differ signif- 
icantly from twentieth-century views. First, Darwin held that environmental 
changes, acting on the reproductive organs or the body, were necessary to gen- 
erate variation. Second, hereditarity was a developmental, not a transmitional 
process. . . " 

At the time of Darwin, the available technology could not, of course, reveal the bio- 
chemical mechanisms of heredity. Nevertheless, scientists hke Hugo de Vries and Erwing 
Schrodinger have had powerful insight on this mechanisms, even before the necessary 
technology became available, de Vries (1900), for example, advanced the following hy- 
potheses: 

"1. Protoplasm is made up of numerous small units, which are bearers of 
the hereditarity characters. 2. These units are to be regarded as identical with 
molecules. " 

In his book What is Life, Schrodinger (1945) advanced more detailed hypotheses about 
the genetic coding mechanisms, based on far reaching theoretical insights provided by 
quantum mechanics. This small book was a declared source of inspiration for both James 
Watson and Francis Crick, who, in 1953, discovered the double-helix molecular struc- 
ture of DNA, opening the possibility of deciphering the genetic code and its expression 
mechanisms. 

3- Continuous variations. From several passages of Darwin's works, it is clear that he 
saw actual variations as coming from a continuum of potential possibilities: 

"[as] I have attempted to show in my work on variation... they [are] ex- 
tremely slight and gradual." Darwin (1959, p. 86). 

"On the slow and successive appearence of new species: . . . organic beings 

accord best with the common view of the immutability of species, or with that of 
their slow and gradual modification, through variation and natural selection. " 
Darwin (1959, p.l67). 

"It is indeed manifest that multitudes of species are related in the closest 
manner to other species that still exist, or have lately existed; and it will 
hardly be maintained that such species have been developed in an abrupt or 
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sudden manner. Nor should it be forgotten, when we look to the special parts 
of allied species, instead of to distinct species, that numerous and wonderfully 
fine graduations can be traced, connecting together widely different structures. " 
Darwin (1959, p.ll7). 

The first modern reference for discrete or modular genetic variations can be found 
in the work of Gregor Mendel (1865), see next paragraph. It was unfortunate that the 
ideas of Mcndcl, working at a secluded monastery in Briinn (Brno), were not immediately 
appreciated. For a contemporary view of evolution and modularity, see Margulis (1999) 
and Margulis and Sagan (2003). 

"The Forms of the Hybrids: 

With some characters. . . one of the two parental characters is so prepon- 
derant that it is difficult, or quite impossible, to detect the other in the hybrid. 

This is precisely the case with the Pea hybrids. In the case of each of the 
7 crosses the hybrid- character resembles that of one of the parental forms so 
closely that the other either escapes observation completely or cannot be de- 
tected with certainty. This circumstance is of great importance in the determi- 
nation and classification of the forms under which the offspring of the hybrids 
appear. Henceforth in this paper those characters which are transmitted entire, 
or almost unchanged in the hybridization, and therefore in themselves consti- 
tute the characters of the hybrid, are termed the dominant, and those which 
become latent in the process recessive. The expression "recessive" has been 
chosen because the characters thereby designated withdraw or entirely disap- 
pear in the hybrids, but nevertheless reappear unchanged in their progeny, as 
will be demonstrated later on. " 

The third point of divergence, variations discreteness, is, of course, closely linked with 
the second, the nature of genetic coding. However, its implications are much deeper, as 
examined in the next section. 



5.7.2 Modularity Requires Quantization 

The ideas of Herbert Simon about modularity, examined in section 3.2, seem to receive 
empirical support from anywhere we look in the biological world. Ksenzhek and Volkov 
(1998, p. 80), also quoted in Souza and Manzatto (2000), for example, gives the following 
example from Botany: 

"A plant is a complicated, multilevel, hierarchical system, which provides a 
very high degree of integration, beginning from the elementary process of catch- 
ing light quanta and ultimately resulting in the functioning of a macroscopic 
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Table 3. Plant Energetics Hierachical and Modular Structure. 



plant as an entire organism. The hierarchical structure of plants may be ex- 
amined in a variety of aspects. (The following) table shows seven hierarchical 
levels of mass and energy. " 

As an example of how to interpret this table, we give further details concerning its 
first line: in a thylakoid membrane, about 300 chlorophyll molecules act like an antenna 
in a reaction center or photosynthetic unit, capable of absorbing light quanta at a rate of 
about IK cycles / second. This energy conversion cycle absorbs photons of about l.SeV 
(430 Hz or 700nm), synthesizing compounds, carbohydrates and oxygen, at an energy 
level of about 1.2eV higher than its input compounds, carbon dioxide and water. 

Ksenzhek and Volkov (1998, p. 80), see next quotation, also makes an important remark 
concerning the need for a specific and non-reductionist interpretation of each line in the 
above table, or structural level in the organism. For related aspects in Biology, see Buss 
(2007). Niels Bohr (1987b, Light and Life, p.3-12; Biology and Atomic Physics, p.13-22) 
presents a similar argument based on the general concept of complementarity. 

"It should be noted that any hierarchical level that is above another level 
cannot be considered as the simple sum of the elements belonging to that lower 
level. In all cases, each step from a given level of the hierarchical staircase to 
the next one is followed by the development of new features not inherent in the 
elements of the lower level. " 

Table 2 stops at somewhat arbitrary levels and could be extended further up or down. 
Higher levels in the table would enter the domains of Ecology. Lower levels would pene- 
trate the domains of Chemistry, and then Physics. At this point, we make an astonish- 
ing observation: Classical Physics cannot accommodate stable atomic models. Classical 
Physics gives no support for discreteness or modularity of any kind. Hence, our modular 
view of the world would be, within classical Physics, a giant with feet of clay! Werner 
Heisenberg (1958, p. 5, 6) describes the situation as follows: 

"In 1911 Rutherford's observations... resulted in his famous atomic model. 
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The atom is pictured as consisting of a nucleus, which is positively charged and 
contains nearly the total mass of the atom, and electrons, which circle around 
the nucleus like planets circle around the sun. The chemical bond between 
atoms of different elements is explained as an interaction between the outer 
electrons of the neighboring atoms; it has not directly to do with the atomic 
nucleus. The nucleus determines the chemical behavior of the atom through its 
charge which in turn fixes the number of electrons in the neutral atom. Initially 
this model of the atom could not explain the most characteristic feature of 
the atom, its enormous stability. No planetary system following the laws of 
Newton's mechanics would ever go back to its original configuration after a 
collision with another such system. But an atom of the element carbon, for 
instance, will still remain a carbon atom after any collision or interaction in 
chemical binding. 

The explanation of this unusual stability was given by Bohr in 1913, through 
the application of Planck's quantum hypothesis. An atom can change its energy 
only by discrete energy quanta, this must mean that the atom can exist only in 
discrete stationary states, the lowest of which is the normal state of the atom. 
Therefore, after any kind of interaction, the atom will finally always fall back 
into its normal state. " 




Figure 12: Orbital Eigensolutions for the Hydrogen Atom. 
Figure 13: Orbital Transitions for Hydrogen Spectral Lines; 
Series: Lyman, n = 1; Balmer, n = 2; Paschen, n = 3; m = n + 1, . . . oo. 



Bohr's model is based on the quantization of the angular momentum of the electron 
in the planetary atomic model. The wave-particle duality metaphor can give us a simple 
visualization of Bohr's model. As already mentioned in section 4.2, a string of length L 
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with two fixed ends can only accommodate (stationary) waves whose (half) wavelength 
are a multiple of the string length, i.e. L = n\, n = 1,2,3,.... The first one (n = 1, 
longer wavelength, lower frequency) is called the fundamental frequency of the string, and 
the others (n = 2, 3 . . ., shorter wavelengths, higher frequencies) are called its harmonics. 

Putting together de Broglie's duality principle and the planetary atomic model, we 
can think of the electron's orbit as a circular string of length L = 27rr. Plugging in de 
Broglie's equation, A = h/mv, and imposing the condition of having stable eigenfunctions 
or standing waves, see Enge (1972) and Figure 12, we have 



Planck's constant equals 6.626E' — 34 joule-seconds or A.ISGE — 15 electron-volt-second, 
and the electron mass is 9.11-E — 28 gram. Since the right hand side of this equation is 
the angular momentum of the orbiting electron, de Broglie wave-particle duality principle 
imposes its quantization. 

Bohr's atomic model was also able to, for the first time, provide an explanation for 
another intriguing phenomenon, namely: 

(a) Atoms only emit light at sharply defined frequencies, known as spectral lines; 

(b) The frequencies, u, or wavelengths. A, of these spectral lines are related by integer 
algebraic expressions, like the Balmer-Rydberg-Ritz-Paschen empirical formula. 



where R = 1.0973731568525(73)£'7 m ^ is Rydberg's constant. 

Distinct combinations of integer numbers, < n < m, in BRRP formula give distinct 
wavelengths of the spectrum, see Enge (1972). It so happens that these frequencies arc 
in precise correspondence with the differences of energy levels of orbital eigen-solutions, 
see Figure 13. These are the Hydrogen spectral series of Lyman, n = 1, Balmer, n — 2, 
Paschen, n — 3, and Brackett, n — A, ior m — n + 1, . . . oo. Similar spectral series have 
been known for other elements, and used by chemists and astronomers to identify the 
composition of matter from the light it radiates. Rydberg's constant can be written as 
R = mgC^/ (8eQ/?;V:), where rUe is the rest mass of the electron, e is the elementary charge, 
eo is the permittivity of free space, h is the Planck's constant, and c is the speed of light 
in vacuum. 

The importance attributed by Bohr to the emergence of these sharp (discrete) eigen- 
solutions out of a higher dimensional continuum of possibilities is emphasized in Bohr 
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Wavelengths of a complete series of spectral lines in the hydrogen spectrum 
can be expressed with the aid of integers. This information, he [Bohr] said, 
left an indelible impression on him. " 

Approximation methods of perturbation theory can be used to compute probabihties 
of spontaneous and induced transitions between the different orbitals or energy states 
of an atom, and these transition rates can be observed as intensities of the respective 
spectral hnes, see Enge (1972, ch.8), Landshoff (1998, ch.7) and McGervey (1995, ch.l4). 
Comparative analyses between the value and accuracy of these theoretical calculations 
and empirical observations are of obvious interest. However, the natural interpretation 
of these analyses immediately generates statements about the uncertainty of transition 
rates, expressed as probabilities of probabilities. Hence, as explained in section 5.6.3, 
these statements collide with the canons of the subjectivist epistemological framework 
and are therefore unaceptable in orthodox Bayesian statistics. 



5.7.3 Quantization Entails Objective Probability 

An objective form of probability is at the core of quantum mechanics theory, as seen 
in previous sections. However, probabilistic explanations or probabilistic causation have 
been, at least from a historical perspective, very controversial concepts. This has been 
so since the earliest times. Aristotle (Physics, II,4,195b-196a) discusses events resulting 
from coincidences or incidental circumstances. If such an event serves a conscious human 
purpose, it is called tvxv, translated as luck or fortune. If it serves the "unconscious 
purposiveness of nature", it is called avToiiaroy, translated as chance or accident. 

"We must inquire therefore in what manner luck and chance are present 
among the causes enumerated, and whether they are the same or different, and 
generally what luck and chance are. 

Thus, we must inquire what luck and chance are, whether they are the same 
or different, and how they fit into our division of causes. 

Some people even question whether they are real or not. They say that- 
nothing happens by chance, but that everything which we ascribe to luck or 
chance has some definite cause. 

Others there are who, indeed, believe that chance is a cause, but that it is 
inscrutable to human intelligence, as being a divine thing and full of mystery. " 

Aristotle (Physics, II,4,195b-196a) also reports some older philosophical traditions 
that made positive use of probabilistic causation, such as a stochastic development or 
evolution theory due to Empedocles: 

"Wherever then all the parts came about just what they would have been 
if they had come be for an end, such things survived, being organized spon- 
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taneously in a fitting way; whereas those which grew otherwise perished and 
continue to perish, as Empedocles says. . . " 

Many other ancient cultures accepted probabilistic arguments and/or did make use 
of randomized procedures, see Davis (1969), Kaptchuk and Kerr (2004) and Rabinovitch 
(1973). Even the biblical narrative, so averse to magic of any sort, presents the idea 
that destiny is ultimately inscrutable to human understanding, see for example Exodus 
(XXXIII, 18-23): 

Moses, who is always willing to speak his mind, asks God for perfect knowledge: 

And Moses said: I pray You, show me Your glory! 

In response God, Who is always ready to explain to Moses Who makes the rules, tells 
him that perfect knowledge can not be achieved by a living creature. This verse may also 
allegorically indicate that temporal irreversibility is a necessary consequence of such veil 
of uncertainty: 

And the Lord said: You cannot see My face, for no man can see Me and live!... 
I will enclose and confine you, and protect you in My manner... (so that) 
You shall see My back, but My face shall not be seen. 

Nevertheless, the concepts of stochastic evolution and probabihstic causation lost pres- 
tige along the centuries. Prom the comments of Gould and Lenoir in section 7.1, we may 
conclude that at the XVIII and early XIX century its status reached the lowest level ever. 
It is ironic than that stochastic evolution is the concept at the eye of the storm of some 
of the most important scientific revolutions of the late XIX and XX century. 

As seen in section 6, Quantum Mechanics entails Heisenberg's uncertainty principle, 
stating that we can not measure (in practice or in theory) the classical variables describing 
the motion of a particle with a precision beyond a hard threshold given by Planck's con- 
stant. Hence, the available information about a physical system is, in quantum mechanics, 
governed by laws that are in nature essentially probabilistic, or, as stated in Ruhla (1992, 
p.162), 

"No longer is it chance as a matter of ignorance or of incompetence: it is 
chance quintessential and unavoidable. " 

The path leading to an essentially stochastic world-view was first foreseen by people 
far ahead of their time, like C.S.Peirce and L.Bozmann, a path that was than advanced 
by reluctant revolutionaries like M. Planck, A. Einstein, and E. Schrodinger, who had 
a major participation in forging the new concept of probability, but that were at the 
same time, still emotionally attached to classical concepts. Pinally, a third generation. 
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including N.Bohr, W.Heisenberg and M.Born fully embraced the new concept of objective 
probability. Of course, as with all truly innovative concepts, it will take mankind at least 
a few generations to truly assimilate and incorporate the new idea. 

5.8 Final Remarks 

The "objectification of probability" and the consequent raise of the ontological status of 
stochastic evolution and/or probabilistic causation was arguably one of the two greatest 
innovations of modern physics. The other great innovation is the "geometrization of 
space-time" in Einstein's theories of special and general relativity, see French (1968) and 
Martin (1988) for intuitive introductions, Sachs and Wu (1977) for a rigorous treatment, 
and Misner et al. (1973) for an encyclopedic treatise. 

The manifestation of physical quantization and (special) relativistic geometry is regu- 
lated by Planck's constant and the speed of light. The value of these constants in standard 
(international) metric units, h = 6.6E — 34 Js and c = 3.0E + 8m/s, have, respectively, a 
tiny and huge order of magnitude, making it easy to understand why most of the effects of 
modern physics are not immediately perceptible in our ordinary life experience and, there- 
fore, why classical physics can offer acceptable approximations in many circumstances of 
common statistical practice. However, modern physics has forever changed some of our 
most basic concepts related to space, time, causality and probability. Moreover, we have 
seen in this chapter how some of these concepts, like modularity and probabilistic causa- 
tion, are essential to our theories and to understand phenomena in many other fields. We 
have also seen how quantization or stochastic evolution have a direct or indirect baring 
on areas much closer to our daily life, like Biology and Engineering. Hence, it is of vi- 
tal importance to incorporate these new concepts to a contemporary epistemology or, at 
least, to use an epistemological framework that is not incompatible with these new ideas. 



Chapter 6 

The Living and InteUigent Universe 



"Cybernetics is the science of defensible metaphors. " 

Gordon Pask (1928-1996). 

"You, with all these words...." 
Marisa Bassi Stem (my wife, when I speak too much). 

"Yes I think to myself: What a wonderful world!" 
B.Thiele and G.D.Weiss, in the voice of L.Armstrong. 



In the article Mirror Neurons, Mirror Houses, and the Algebraic Structure of the Self, 
by Ben Goertzel, Onar Aam, F. Tony Smith and Kent Palmer (2008) and the companion 
article of Goertzel (2007), the authors provide an intuitive explanation for the logic of 
mirror houses, that is, they study symmetry conditions for specular systems entailing the 
generation of kaleidoscopic images. In these articles, the authors share (in my opinion) 
several important insights on autopoictic systems and constructivist philosophy. A more 
prosaic kind of mirror house used to be a popular attraction in funfairs and amusement 
parks. The entertainment then came from misperceptions about oneself or other objects. 
More precisely, from the misleading ways in which a subject sees how or where are the 
objects inside the mirrorhouse, or how or where himself stands in relation to other objects. 

The main objective of this chapter is to show how similar misperceptions in science 
can lead to ill-posed problems, paradoxical situations and even misconceived philosophi- 
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cal dilemmas. The epistemological framework of this discussion will be that of cognitive 
constructivism, as presented in previous chapters. In this framework, objects within a 
scientific theory are tokens for eigen-solutions which Heinz von Foerster characterized by 
four essential attributes, namely those of being discrete (precise, sharp or exact), stable, 
separable and composable. The Full Bayesian Significance Test (FBST) is a possibihs- 
tic belief calculus based on a (posterior) probabilistic measure originally conceived as a 
statistical significance test to assess the objectivity of such eigen-solutions, that is, to 
measure how well a given object manifests or conforms to von Foerster's four essential 
attributes. 

The FBST belief or credal value of hypothesis H given the observed data X is the 
e-value, ev{H\X), interpreted as the epistemic value of hypothes H (given X), or the 
evidence value of data X (supporting H). A formal definition of the FBST and several of 
its implementations for specific problems can be found in the author's previous articles, 
and summarized in appendix A. From now on, we will refer to Cognitive Constructivism 
accompanied by Bayesian statistical theory and its tool boxes, as laid down in the afore- 
mentioned articles, as the Cog-Con epistemological framework. 

Instead of reviewing the formal definitions of the essential attributes of eigen-solutions, 
we analyze instead the Origami example, a didactic case presented by Richard Dawkins. 
This is the done in section 1. The origami example is so simple that it may look trivial and, 
in some sense, it is so. In subsequent sections we analyze in which ways the eigen-solutions 
found in the practice of science can be characterized as non-trivial, and also highlight some 
(in my view) common misconceptions about the nature of these non-trivial objects, just 
like distinct forms of illusion in a mirror-house. 

In section 2 we contrast the control, precision and stability of morphogenic folding 
processes in autopoietic and allopoietic systems. In section 3 we concentrate in object 
orientation and code reuse, inter-modular adaptation and resonance, and also analyze the 
yoyo diagnostic problem. In section 4 we explore auto-catalytic and hypercyclic networks, 
as well as some related bootstrapping paradoxes. This section is heavily infiuenced by the 
work of Manfred Eigen. Section 5 focus on explanations of specific components, single 
links or partial chains in long cyclic networks, including the meaning of some forms of 
directional (such as upward or downward) causation. In section 6 we study the emergence 
of asymptotic eigen-solutions such as thermodynamic variables or market prices, and 
in section 7 we analyze the ontological status of such entities. In section 8 we study 
the limitations in the role and scope of conceptual distinctions used in science, and the 
importance of probabilistic causation as a mechanism to overcome, in a constructive way, 
some of the resulting dilemmas. In short, section 2 to 8 discus autopoiesis, modularity, 
hypercycles, emergence, and probability as sources of complexity and forms of non-trivial 
organization. Our final remarks are presented in section 9. 

In this chapter we have made a conscious effort to use examples that can be easily 
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visualized in space and time scales directly perceptible to our senses, or at least as close 
as possible to it. We have also presented our arguments using, whenever possible, very 
simple (high school level) mathematics. We did so in order to make the examples intuitive 
and easy to understand, so that we could concentrate our attention on the epistemological 
aspects and difficulties of the problems at hand. Several interesting figures and images 
that illustrate some of the concepts discussed in this chapter are contained in the website 
www . ime . usp . br/~ j stern/pub/gallery2 . pdf / . 

6.1 The Origami Example 

The Origami example, from the following text in Blackmore (1999, p.x-xii, emphasis 
are ours) was given by Richards Dawkins to present the notion of reliable replication 
mechanisms in the context of evolutionary systems. Dawkins' example contrasts two 
versions of the Chinese Whispers game using distinct copy mechanisms. 

Suppose we assemble a line of children. A picture, say, a Chinese junk, is 
shown to the first child, who is asked to draw it. The drawing, but not the 
original picture, is then shown to the second child, who is asked to make her 
own drawing of it. The second child's drawing is shown to the third child, 
who draws it again, and so the series proceeds until the twentieth child, whose 
drawing is revealed to everyone and compared with the first. Without even 
doing the experiment, we know what the result will be. The twentieth drawing 
will be so unlike the first as to be unrecognizable. Presumably, if we lay the 
drawings out in order, we shall note some resemblance between each one and 
its immediate predecessor and successor, but the mutation rate will be so high 
as to destroy all semblance after a few generations. A trend will be visible as 
we walk from one end of the series of drawings to the other, and the direction 
of the trend will be degeneration... 

High fidelity is not necessarily synonymous with digital. Suppose we set up 
our Chinese Whispers Chinese Junk game again, but this time with a crucial 
difference. Instead of asking the first child to copy a drawing of the junk, 
we teach her, by demonstration, to make an origami model of a junk. When 
she has mastered the skill, and made her own junk, the first child is asked 
to turn around to the second child and teach him how to make one. So the 
skill passes down the line to the twentieth child. What will be the result of 
this experiment? What will the twentieth child produce, and what shall we 
observe if we lay the twenty efforts out in order along the ground? ... 

In several of the experiments, a child somewhere along the line will forget 
some crucial step in the skill taught him by the previous child, and the line of 
phenotypes will suffer an abrupt macromutation which will presumably then 
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be copied to the end of the hne, or until another discrete mistake is made. The 
end result of such mutated lines will not bear any resemblance to a Chinese 
junk at all. But in a good number of experiments the skill will correctly pass 
all along the line, and the twentieth junk will be no worse and no better, on 
average, than the first junk. If we lay then lay the twenty junks out in order, 
some will be more perfect than others, but imperfections will not be copied 
on down the line... 

Here are the first five instructions... for making a Chinese junk: 

1. Take a square sheet of paper and fold all four corners exactly into the 
middle. 

2. Take the reduced square so formed, and fold one side into the middle. 

3. Fold the opposite side into the middle, symmetrically. 

4. In the same way, take the rectangle so formed, and fold its two ends 
into the middle. 

5. Take the small square so formed, and fold it backwards, exactly along 
the straight line where you last two folds met... 

These instructions, though I would not wish to call them digital, are po- 
tentially of very high fidelity, just as if they were digital. This is because they 
all make reference to ideahzed tasks like 'fold the four corners exactly into the 
middle'... The instructions are self-normalizing. The code is error-correcting... 

Dawkins recognizes that instructions for constructing an origami have remarkable 
properties, providing the long term survival of the subjacent meme, i.e. specific model or 
single idea, expressed as an origami. Nevertheless, Dawkins is not sure how he "wishes to 
call" these properties (digital? high fidelity?). What adjectives should we use to appropri- 
ately describe the desirable characteristics that Dawkins perceives in these instructions? 
I claim that von Foerster's four essential attributes of eigen-solutions offer an accurate 
description of the properties relevant to the process in study. 

The instructions and the corresponding (instructed) operations are precise, stable, 
separable and composable. A simple interpretation of the meaning of these four attributes 
in the origami example is the following: 

Precision: An instruction like "fold a paper joining two opposite corners of the square" 
implies that the folding must be done along a diagonal of the square. A diagonal is a 
specific line, a 1-dimensional object in the 2-dimensional sheet of paper. In this sense the 
instruction is precise or exact. 

Stability: By interactively adjusting and correcting the position of the paper (before 
making a crease) it is easy to come very close to what the instruction specifies. Even if the 
resulting fold is not absolutely perfect (in practice it actually never is), it will probably 
still work as intended. 
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Compos ability and Separability: Wc can compose or superpose multiple creases in the 
same sheet of paper. Moreover, adding a new crease will not change or destroy the existing 
ones. Hence, we can fold them one at a time, that is, separately. 

These four essential attributes are of fundamental importance in order to understand 
scientific activity in the Cog-Con framework. Moreover, Dawkins' origami example illus- 
trates these attributes with striking clarity and simplicity. 

In the following sections we will examine other examples, which are less simple, not 
so clear or non-trivial in a distinct and characteristic way. We will also draw atten- 
tion to some confusions and mistakes often made when analyzing systems with similar 
characteristics. 

6.2 Autopoietic Control, Precision, Stability 

The origami folding is performed and controlled by an external agent, the person folding 
the paper. In contrast, organic development processes are self-organized. These processes 
are not driven by an external agent, do not require external supervision, and usualy are 
not even amenable to external corrections. While artifacts and machines manufactured 
like an origami are called allopoietic, from aXXo-irotriaLq - external production, living 
organisms are called autopoietic, from avTO-TTOcrjaL'; - self production. 

Autopoiesis is a non-trivial process, in many interesting ways. For example, the in- 
existence of external supervision or correction mechanism requires an autopoietic process 
to be stable. Moreover, typical biological processes occur in environments with high lev- 
els of noise and have large (extra) variability. Hence the process must be intrinsically 
self-correcting and redundant so that its noisy implementation does not compromise the 
viability of the final product. 

6.2.1 Organic Morphogenesis: (Un)Folding Symmetries 

In this section we make some considerations about morphogenic biological processes, 
namely, we study examples of tissue folding in early embryonic development. This process 
naturally invites not only strong analogies, but also sharp contrasts with the origami 
example. At a macroscopic (supra cellular) level, the organisms' organs and structures 
are built by tissue movements, as described in Forgacs and Newman (2005, p. 109), and 
Saltzman (2004, p.38). 

The main types of tissue movements in morphogenic process are: 

- Epiboly: spreading of a sheet of cells over deeper layers. 

- Emboly: inward movement of cells which is of various types as: 
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- Invagination: infolding or insinking of a layer, 

- Involution: inturning, inside rotation or inward movement of a tissue. 

- Delamination: splitting of a tissue into 2 or more parallel layers. 

- Convergent /Divergent Extension: stretching together/apart of two distinct tissues. 

The blastula is an early stage in the embryonic development of most animals. It 
is produced by cleavage of a fertilized ovum and consists of a hollow sphere of around 
128 cells surrounding a central cavity. Prom this point on, morphogenesis unfolds by 
successive tissue movements. The very first of such moves is known as gastrulation, a deep 
invagination producing a tube, the archenteron or primitive digestive tract. This tube 
may extend all the way to the pole opposing the invagination point producing a second 
opening. The opening(s) of the archenteron become mouth and anus of the developing 
embryo. 

Gastrulation produces three distinct (germ) layers, that will further differentiate into 
several body tissues. Ectoderm, the exterior layer, will further differentiate into skin 
and nervous systems. Endoderm, the innermost layer at the archenteron, generates the 
digestive system. Mesoderm, between the ectoderm and endoderm, differentiates into 
muscles, connective tissues, skeleton, kidneys, circulatory and reproductive organs. We 
will use this example to highlight some important topics, some of which will be explored 
more thoroughly in further sections. 

Discrete vs. Exact or Precise Symmetries 

Notice that origami instructions, that implicitly rely on the symmetries characterizing 
the shape of the paper, require foldings at sharp edges or cresses. Hence, a profile of the 
folded paper sheet may look like it breaks (is non-differentiable) at a discrete or singular 
point. 

Organic tissue foldings have no sharp edges. Nevertheless, the (ideahzed) symmetries 
of the folded tissues, like the spherical symmetry of the blastula, or the cylindrical symme- 
try of the gastrula, can be described by equations just as exact or precise, sec Bcloussov 
(2008), Nagpal (2002), Odel et al. (1980), Tarasov (1986), and Weliky and Ostcr (1990). 
This is why we usually prefer the adjectives precise or exact to the adjective discrete used 
by von Foester in his original definition of the four essential properties of an eigen-solution. 



Centralized vs. Decentralized Control 

In morphogenesis, there is no agent acting like a central controller, dispatching messages 
ordering every cell what to do. Quite the opposite, the complex forms and tissue move- 
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merits at a global or macroscopic (supra cellular) scale are the result of collective cellular 
behavior patterns based on distributed control. The control mechanisms rely on simple 
local interaction between neighboring cells, see Keller et al. (2003), Koehl (1990), and 
Newman and Comper (1990). Some aspects of this process are further analyzed in sections 
3 and 6. 



6.3 Object Orientation and Code Reuse 

At the microscopic level, cells at the several organic tissues studied in the last section are 
differentiated by distinct metabolic reaction patterns. However, the genetic code of any 
individual cell in a organism is identical (as always in biology, there are exceptions, but 
they are not relevant to this analysis), and cellular differentiation at distinct tissues are 
the result of differentiated (genetic) expressions of this sophisticated program. 

As studied in Chapter 5, complex systems usually have a modular hierarchical struc- 
ture or, in computer science jargon, an object oriented design. In allopoietic systems 
object orientation is achieved by explicit design, that is, it has to be introduced by a 
knowledgeable and disciplined programmer, see Budd (1999). In autopoietic systems 
modularity is an implicit and emergent property, as analyzed in Angeline (1996), Banzaff 
(1998), Iba (1992), Lauretto at al. (2009) and Chapter 5. 

Object oriented design entails the reuse, over and over, of the same modules (genes, 
functions or sub-routines) as control mechanisms for different processes. The ability 
to easily implement this kind of feature was actively pursued in computer science and 
software engineering. Object orientation was also discovered, with some surprise, to be 
naturally occurring in developmental biology, see Carrol (2005). 

However, like any abused feature, code reuse can also become a burden in some circum- 
stances. The difficulty of locating the source of a functionality (or a bug) in an intricate 
inheritance hierarchy, represented by a complex dependency graph, is known in computer 
science as the yoyo problem. According to the glossary in Budd (1999, p. 408), "Yoyo 
problem: Repeated movements up and down the class hierarchy that may be required 
when the execution of a particular method invocation is traced." 

Systems undergoing many changes or modifications, under repeated adaptation or 
expansion, or on rapid evolution are specially vulnerable to yoyo effects. Unfortunately, 
the design of the human brain and its mental abilities are under all of the above conditions. 
In the next subsection we study some examples in this area, related to biological neural 
networks and language. These examples also include some mental dissociative phenomena 
that can be considered as manifestations of the yoyo problem. 
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6.3.1 Doing, Listening and Answering 

In this section we study some human capabihties related to doing (acting), hstening (hn- 
guistic understanding) and answering (dialogue) . The capabilities we have chosen to study 
are related to the phylogenetic acquisition and the ontogenetic development of: 

- Mechanisms for precision manipulation, production of speech and empathic feeling; 

- Syntax for complex manipulation procedures, language articulation and behavioral sim- 
ulation; 

- Semantics for action, communication and dialogue; and the learning of 

- Technological know-how, social awareness and self- awareness. 

When considering an action in a modern democratic society, we usually deliberate 
what to do (unless there is already a tacit agreement). We then communicate with other 
agents involved to coordinate this action, so that we are finally able to do what has to 
be done. Evolution, it seems, took exactly the other way around. Phylogenetically, the 
path taken by our species follows a stepwise development of several mechanisms (that 
were neither independent nor strictly sequential), including: 

1. A mechanism for 3-dimensional vision and precision measurement, fine motor con- 
trol of hands and mouth, and visual-motor coordination for the complex procedures of 
precision manipulation. 

2. Mechanisms for imitating, learning and simulating the former procedures or actions. 

3. Mechanisms for simulating (possible) actions taken by other individuals, their 
consequences and motivations, that is, mechanisms for awareness and (behavioral) under- 
standing of other individuals. 

4. A mechanism for communicating (possible) actions, used for commanding, control- 
ing and coordinating group actions. The use of such a mechanism implies a degree of 
awareness of others, that is, some ability to communicate, explain, listen and learn what 
you do, you - an agent like me. 

5. Mechanisms for dialoging and deliberating, that is, for negotiating, goal selecting 
and non-trivial social planning. The use of such mechanisms implies some self-awareness 
or consciousness, that is, the conceptualization of an ego, an abstract / - an agent like 
you. 

In a living individual, all of these mechanisms must be well integrated. Consequently, 
it is natural that they work using coherent implicit grammars, reflecting compatible sub- 
jacent rules of composition for action, language and inter-individual interaction. Indeed, 
resent research in neuro-science confirm the coherence of these mechanisms. Moreover, 
this research shows that this coherence is based not just on compatible designs of separate 
systems, but on intricate schemes of use and reuse of the same structures, namely, the 
firmware code or circuits implemented as biological neural networks. 
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Mirror neuron is a concept of neuroscicnce that highlights the reuse of the same circuits 
for distinct functions. A mirror neuron is part of a circuit which is activated (fires) when 
an individual executes an action, and also when the individual observes another individual 
executing the same action as if he, the observer, were performing the action himself. The 
following passages, from important contemporary neuro-scientists, give some hints on how 
the mechanisms mentioned in the past paragraph are structured. 

The first group of quotes, from Hesslow (2002, p. 245), states the mirror neuron sim- 
ulation hypothesis, according to which, the same circuits used to control our actions are 
used to learn, simulate, and finally "understand" possible actions taken by other individ- 
uals. According to the simulation hypothesis, we are then naturally endowed with the 
capability of observing, listening, and "reading the mind" of (that is - understanding, 
by simulation, the meaning or intent of the possible actions taken by) our fellow human 
beings. 

...the simulation hypothesis states that thinking consists of simulated interac- 
tion with the environment and rests on the following three core assumptions: 

(1) simulation of actions: we can activate motor structures of the brain in a 
way that resembles activity during a normal action but does not cause any 
overt movement; 

(2) simulation of perception: imagining perceiving something is essentially the 
same as actually perceiving it, only the perceptual activity is generated by the 
brain itself rather than by external stimuli; 

(3) anticipation: there exist associative mechanisms that enable both behav- 
ioral and perceptual activity to elicit other perceptual activity in the sensory 
areas of the brain. Most importantly, a simulated action can elicit perceptual 
activity that resembles the activity that would have occurred if the action had 
actually been performed, (p. 5). 

In order to understand the mental state of another when observing the other 
acting, the individual imagines herself /himself performing the same action, a 
covert simulation that does not lead to an overt behavior, (p. 5). 

The second group of quotes, from Rizzolatti and Arbib (1998), states the mirror neuron 
linguistic hypothesis, according to which, the same structures used for action simulation, 
are reused to support human language. 

Our proposal is that the development of the human lateral speech circuit is 
a consequence of the fact that the precursor of Broca's area was endowed, 
before speech appearance, with a mechanism for recognizing actions made by 
others. This mechanism was the neural prerequisite for the development of 
inter-individual communication and finally of speech. We thus view language 
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in a more general setting than one that sees speech as its complete basis. 
(Rizzo.p.l90). 

...a 'pre-linguistic grammar' can be assigned to the control and observation of 
actions. If this is so, the notion that evolution could yield a language system 
'atop' of the action system becomes much more plausible, (p. 191). 

In conclusion, the discovery of the mirror system suggests a strong link between 
speech and action representation. 'One sees a distinctly linguistic way of doing 
things down among the nuts and bolts of action and perception, for it is 
there, not in the remote recesses of cognitive machinery, that the specifically 
linguistic constituents make their first appearance', (p. 193-194). 

Finally, a third group of quotes, from Ramachandran (2007), states the mirror neu- 
ron self- awareness hypothesis, according to which, the same structures used for action 
simulation are reused, over again, to support abstract concepts related to consciousness 
and self-awareness. According to this perspective, perhaps the most important of such 
concepts, that of an abstract self-identity or ego, is built upon one's already developed 
simulation capability for looking at oneself as if looking at another individual. 

I suggest that 'other awareness' may have evolved first and then counter- 
intuitively, as often happens in evolution, the same ability was exploited to 
model one's own mind - what one calls self awareness. 

How does all this lead to self awareness? I suggest that self awareness is simply 
using mirror neurons for 'looking at myself as if someone else is look at me' 
(the word 'me' encompassing some of my brain processes, as well). 

The mirror neuron mechanism - the same algorithm - that originally evolved 
to help you adopt another's point of view was turned inward to look at your 
own self. This, in essence, is the basis of things like 'introspection'. 

This in turn may have paved the way for more conceptual types of abstraction; 
such as metaphor ('get a grip on yourself). 

Yoyo Effects and the Human Mind 

Prom our analyses in the preceding sections, one should expect, as a consequence of the 
heavy reuse of code under fast development and steady evolution, the sporadic occurrence 
of some mental yoyo problems. Such yoyo effects break the harmonious way in which the 
same code is (or circuits are) supposed to work as an integral part with several functions 
used to do, listen and answer, that is, to control action performance, language communi- 
cation, and self or other kind of awareness. In psychology, many of such effects are known 
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as dissociative phenomena. For carefully controlled studies of low level dissociative phe- 
nomena related to corporal action-perception, see Schooler (2002) and Johansson et al. 
(2008). 

In the following paragraphs we give a glimpse on possible neuroscience perspectives of 
some high level dissociative phenomena. Simulation mechanisms are (re)used to simulate 
one's actions, as well as other agents' actions. Contextualized action simulation is the 
basis for intentional and motivational inference. From there, one can assess even higher 
abstraction levels such as tactical and strategic thinking, or even ethics and morality. 
But these capabilities must rely on some principle of decomposition, that is, the ability 
to separate, to some meaningful degree, one's own mental state from the mental state of 
those whose behavior is being simulated. This premise is clearly stated in Decety and 
Grezes (2005, p.5): 

One critical aspect of the simulation theory of mind is the idea that in trying to 
impute mental states to others, an attributor has to set aside her own current 
mental states and substitute those of the target. 

Unfortunately, as seen in the preceding section, the same low level circuits used to 
support simulation are also used to support language. This can lead to conflicting requests 
to use the same resources. For example, verbalization requires introspection, a process 
that conflicts with the need to set aside one's own current mental state. This conflict leads 
to verbal overshadowing - the phenomenon by which verbally describing or explaining an 
experienced or simulated situation somehow modifies or impairs its correct identification 
(like recognition or recollection), or distorts its understanding (like contcxtualization or 
meaning). Some causes and consequences of this kind of conflict are addressed by lacoboni 
(2008, p.270): 

Mirror neurons are pre-motor neurons, remember, and thus are cells not really 
concerned with our reflective behavior. Indeed, mirroring behaviors such as 
the chameleon effect seem implicit, automatic, and pre-reflexive. Meanwhile, 
society is obviously built on explicit, deliberate, reflexive discourse. Implicit 
and explicit mental processes rarely interact; indeed, they can even dissociate. 
(p.270). 

Psychoanalysis can teach us a lot about high level dissociations such as emotional / 
rational psychological mismatches and individual / social behavioral misjudgments. For a 
constructivist perspective of psychotherapy see Efran et al. (1990), and further comments 
on section 7. 

We end up this section by posing a tricky question capable of inducing the most spec- 
tacular yoyo bouncings. This provocative question is related to the role played by division 
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algebras; Gocrtzcl's articles mentioned at the introduction provide a good source of ref- 
erences. Division algebras capture the structure of eigen-solutions entailed by symmetry 
conditions for the recursively generated systems of specular images in a mirror house. The 
same division algebras are of fundamental importance in many physical theories, see Dion 
et al. (1995), Dixon (1994) and Lounesto (2001). Finally, division algebras capture the 
structure of 2-dimensional (complex numbers) and 3-dimensional (quaternion numbers) 
rotations and translations governing human manipulation of objects, see Hanson (2006). 
We can thus ask: Do we keep finding division algebras everywhere out there when trying 
to understand the physical universe because we already have the appropriate hardware 
to see them, or is it the other way around? We can only suspect that any trivial choice 
in the dilemma posed by this trick question, will only result in an inappropriate answer. 
We shall revisit this theme at sections 7 and 8. 

6.3.2 Mnemes, Memes, Mimes, and all that. 

We can make the ladder of hierarchical complexity in the systems analyzed in the last 
sections go even further up, as if it did not climb high enough, by including new steps 
in the socio-cultural realms that stand above the level of simple or direct inter-individual 
interaction, such as art, law, religion, science, etc. The origami example of section 1 
is used by Richard Dawkins as a prototypical meme or a unit of imitation. The term 
mneme, derived from fiurifir], the muse of memory, was used by Richard Semon as a unit 
of retrievable memory. Yet another variant of this term, mime, is derived from iMiir]aL<; or 
imitation. All these terms have been used to suggest a basic model, a single concept, an 
elementary idea, a memory trace or unit, or to convey related meanings, see Blackmore 
and Dawkins (1999), Dawkins (1976), van Driem (2007), Schacter (2001), Schacter et al. 
(1978), and Semon (1904, 1909, 1921, 1923). 

Richard Semon's theory was able to capture many important characteristics concerning 
the storage or memorization, retrieval, propagation, reproduction and survival of mnemes. 
Semon was also able to foresee many important details and interconnections, at a time 
when there were no experimental techniques suitable for an empirical investigation of the 
relevant neural processes. Unfortunately, Semon's analysis also suffers from the yoyo effect 
in some aspects. That is not surprising at all given the complexity of the systems he was 
studying and the lack of suitable experimental tools. These yoyo problems were related to 
some mechanisms, postulated by Semon, for mnemetic propagation across generations, or 
mnemetic hereditarity. Such mechanisms had a Lamackian character, since they implied 
the possibility of hereditary transmission of learned or acquired characteristics. 

In modern Computer Science, the term memetic algorithm is used to describe evo- 
lutive programming based on populational evolution by code (genetic) propagation that 
combines a Darwinian or selection phase, and a local optimization or Lamackian learning 
phase, see Moscato (1989). Such algorithms were inspired by the evolution of ideas and 
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culture in human societies, and they proved to be very efScient for solving some com- 
plex combinatorial problems, see Ong et al (2007) and Smith (2007). Consequently, even 
knowing now, based on contemporary neural science, that some of the concepts developed 
by Semon are not appropriate to explain specific phenomena among those he was study- 
ing, he was definitely postulating, far ahead of his time, some very interesting and useful 
ideas. 

Nevertheless, for Semon's misfortune, he published his theory at the aftermath of the 
great Darwinian victory over the competing Lamarckian view in the field of biological 
evolution. At that time, any perceived contamination by Lamackian ideas was a kiss 
of death for a new theory, even if postulated within a clearly distinct context. As a 
regrettable consequence, the mneme concept was rejected and cast into oblivion for half 
a century, until its revival as Dawkin's meme. Such a drama is by no means unusual in 
the history of science. It seems that some ideas, postulated ahead of their time, have to 
be incubated and remain dormant until the world is ready for them. Another example of 
this kind, related to the concept of statistical randomization, is analyzed in great detail 
in Chapter 3. 

6.4 Hypercyclic Bootstrapping 

On march 1st 2009, the Wikipedia definition for bootstrapping read: 

Bootstrapping or booting refers to a group of metaphors that share a common 
meaning, a self-sustaining process that proceeds without external help. The 
term is often attributed to Rudolf Erich Raspe's story The Adventures of 
Baron Munchausen, where the main character pulls himself out of a swamp, 
though it's disputed whether it was done by his hair or by his bootstraps. 

The attributed origin of this metaphor, the (literally) incredible adventures of Baron 
Miinchhausen, well known as a compulsive liar, makes us suspect that there may be 
something wrong with some of its uses. There are, however, many examples where boot- 
strapping explanations can be rightfully applied. Let us analyze a few examples: 

1. The Tostines mystery: Does Tostines sell more because it is always fresh and 
crunchy, or is it always fresh and crunchy because it sells more? 

This slogan was used at a very successful marketing campaign, that launched the rela- 
tively unknown brand Tostines, from Nestle, to a leading position in the Brazilian market 
of biscuits, crackers and cookies. The expression Tostines mystery became idiomatic in 
Brazilian Portuguese, playing a role similar to that of the expression bootstrapping in 
English. 
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2. The C computer language and the UNIX operating system: Perhaps the most 
successful and influential computer language ever designed, C was conceived having boot- 
strapping in mind. The core language is powerful but spartan. Many capabilities that are 
an integral part of other programming languages are provided by functions in external 
standard libraries, including all device dependent operations such as input-output, string 
and file manipulation, mathematical computations, etc. C was part of a larger project 
to write UNIX as a portable operating system. In order to have UNIX and all of its 
goodies into a new machine (device drivers should already be there), we only have to 
translate the assembly code for a core C compiler, compile a full C compiler, compile 
the entire UNIX system, compile all the application programs we want, and voila, we are 
done. Bootstrapping, as a technological approach, is of fundamental importance for the 
computer industry as it allows the development of evermore powerful software and the 
rapid substitution of hardware. 

3. The Virtuous cycle of open source software: An initial or starting code contri- 
bution is made available at an open source code repository. Developer communities can 
use the resources at the repository according to the established open source hcense. De- 
velopers create software or application programs according to their respective business 
models, affected by the open source license agreements and the repository governance 
policy. The use of existing software motivates new applications or extensions to the ex- 
isting ones, generating the development of new programs and new contributions to the 
open source repository. Code contributions to the repository are flltered by a controlling 
committee according to a governance model. The full development cycle works using the 
highlighted elements as catalysts, and is fuelled by the work of self-interested individuals 
acting according to their own motivations, see Heiss (2007). 

4. The Bethe-Weizsacker main catalytic cycle (CNO-I): 

+ JH ^ i^N + 7 + 1.95MeV; ^^N ^ i|C + e+ + z/ + 2.22MeV; 
i|C + }H ^ i^N + 7 + 7.54MeV; ^^N + }H -> ^fO + 7 + 7.35MeV; 
i|0 ^^N + e+ + iy + 2.75MeV; ^fN + ^ i|C + ^He + 4.96MeV. 

This example presents the nuclear synthesis of one atom of Helium from four atoms of 
Hydrogen. Carbon, Nitrogen and Oxygen act as catalysts in this cyclic reaction, that also 
produces gamma rays, positrons and neutrinos. Note that the Carbon-12 atom used in 
the first reaction is regenerated at the last one. The CNO nuclear fusion cycle is the main 
source of energy in stars with mass twice as large or more than that of the sun. We have 
included this example from nuclear physics in order to stress the fact that catalytic cycles 
play an important role in phenomena occurring in spatial and temporal scales which are 
much smaller than those typical of chemistry or biology, where some of the readers may 
find them more familiar. 

5. RNA and DNA rephcation: DNA and RNA duplication, translation, and copying 
may, in general, be considered the core cycle of life, since it is the central cycle of biological 
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reproduction. Although even a simple description of this process is far too complex to be 
included in this book, its worth noting that RNA and DNA copy mechanisms rely on many 
enzymes and auxiliary structures, which are only available because they themselves are 
synthesized or regenerated in the living cell of other, also very complex, cyclical networks. 

Examples 4 and 5 are taken from Eigen (1977). Examples 3 and 5 are, in Manfred 
Eigen's nomenclature, hypercycles. Eigen defines an autocatalytic cycle as a (chemical) 
reaction cycle that, using additional resources available in its environment, produces an 
excess of one or more of its own reactants. A hypercycle is an autocatalytic reaction of 
second or higher order, that is, an autocatalytic cycle connecting autocatalytic units. In a 
more general context, a hypercycle indicates self-reproduction of second or higher order, 
that is, a second or higher order cyclic production network including lower order self- 
replicative units. In the prototypical hypercycle architecture, a lower order self-replicative 
unit plays then a dual catalytic role: First, it has an auto-catalytic function in its own 
reproduction. Second, it acts like a catalyst promoting an intermediate step of the higher 
order cycle. 

6.4.1 Bootstrapping Paradoxes 

Let us now examine some ways in which the bootstrapping metaphor is wrongfully ap- 
plied, that is, it is used to generate incongruent or inconsistent arguments, supposed to 
accommodate contradictory situations or to explain the existence of impossible processes. 
We will focus on four cases of historical interest and great epistemological importance. 

Perpetua Mobile 

Perhaps the best known paradox related to the bootstraping metaphor is connected to a 
class of examples known as Perpetuum Mobile machines. These machines are supposed to 
operate forever without any external help or even to produce some useful energy output. 
Unfortunately, perpetual mobiles are only wishful thinking, since the existence of such 
a machine would violate the first, second and third laws of thermodynamics. These are 
essentially "no free lunch" principles, formulated as inequalities for the fiow (balance 
or transfer) of matter, energy and information in a general system, see Atkins (1984), 
Dugdale (1996) and Tarasov (1988). 

Hypercyclical processes are not magical and must rely on energy, information (order 
or neg-entropy) and raw materials available at their environment. In fact, the use of 
external sources of energy and information is so important, that it entails the definition 
of metabolism used in Eigen (1977): 

Metabolism: (The process) can become effective only for intermediate states 
which are formed from energy-rich precursors and which are degraded to some 



186 



CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE 



energy-deficient waste. Tlie ability of tlie system to utilize the free energy and 
the matter required for this purpose is called metabolism. The necessity of 
maintaining the system far enough from equilibrium by a steady compensation 
of entropy production has been first clearly recognized by Erwin Schrodinger 
(1945). 

The need for metabolism may come as a disappointment to professional wishful thinkers, 
engineers of perpetuum mobile machines, narcissistic philosophers and other anorexic de- 
signers. Nevertheless, it is important to realize that metabolic chains are in fact an integral 
part of the hypercycle concept. Hypercycles are built upon the possibility that the raw 
material that is supposed to be freely available in the environment for one autocatalytic 
reaction, may very well be the product of another catalytic cycle. Moreover, the same 
thermodynamic laws that prevent the existence of a perpeuum mobile, are fully com- 
patible with a truly wonderful property of hypercycles, namely, their almost miraculous 
efficiency, as stated in Eigen (1977): 

Under the stated conditions, the product of the plain catalytic process will 
grow linearly with time, while the autocatalytic system will show exponential 
growth. 

Evolutionciry View 

The exponential or hyperbolic (super-exponential) growth of processes based on auto- 
catalytic cycles and hypercycles have profound evolutionary implications. Populations 
growing exponentially in environments with limited resources, or even with resources 
growing at a linear or polynomial rate, find themselves in the Maltusian conundrum of 
ever increasing individual or group competition for evermore scarce resources. In this 
setting, selection rules applied to a population of individuals struggling to survive and 
reproduce inexorably leads to an evolutive process. This qualitative argument goes back 
to Thomas Robert Malthus, Alfred Russel Wallace, and Charles Darwin, see Ingraham 
(1982) and Richards (1989). 

Several alternative mathematical models for evolutive processes only confirm the 
soundness of the original Malthus- Wallace-Darwin argument. Eigen (1977, 1978a,b) anal- 
yses evolutionary processes on the basis of dynamical systems models using the language 
of ordinary differential equations. Stern (2008, ch.5) takes In Chapter 5 we take a com- 
pletely different approach, analyzing evolutionary processes on the basis of stochastic 
optimization algorithms using the language of inhomogeneous Markov chains. For other 
possible approaches see Jantsch and Waddington (1976) and Jantsch (1980, 1981). It is 
remarkable however, that the qualitative conclusions of these distinct alternative analyses 
are in complete agreement. 
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The evolutionary view replaces a static scenario by a dynamic context. This replace- 
ment has the side effect of enhancing or amplifying most of the mirror-house ilhisions 
studied in this chapter. No wonder then, that the adoption of an evolutionary view re- 
quires from the observer a solid background on well founded scientific theories together 
with the firm domain of a logical and coherent epistemological framework in order to keep 
his or her balance and maintain straight judgment. 

Building Blocks and Moduleirity 

Another consequence of the analysis of evolutionary processes, using either the dynamical 

systems approach, see Eigen (1977, 1978a,b), or the stochastic optimization approach, see 
Chapter 5, is the spontaneous emergence of modular structures and hierarchical organi- 
zation of complex systems. 

A classic illustration of the need for modular organization is given by the Hora and 
Tempus parable of Simon (1996), see also Growney (1982). This is a parable about two 
watch makers, named Hora and Tempus, both of whom are respected manufacturers and, 
under ideal conditions, produce watches of similar quality and price. Each watch requires 
the assemblage of n = 1000 elementary pieces. However, while Hora uses a hierarchical 
modular design, Tempus does not. Hora builds each watch with 10 large blocks, each 
made of 10 small modules of 10 single parts each. Consequently, in order to make a 
watch, Hora needs to assemble m = 111 modules with r = 10 parts each, while Tempus 
needs to assemble only m — 1 module with r = 1000 parts. It takes either Hora or 
Tempus one minute to put a part in its proper place. Hence, while Tempus can assemble 
a watch in 1000 minutes, Hora can only do it in 1110 minutes. However both work in a 
noisy environment, being subject to interruptions (like receiving a telephone call). While 
placing a part an interruption occurrs with probability of p = 0.01. Partially assembled 
modules are unstable, braking down at an interruption. Under these conditions, the 
expected time to assemble a watch is 



Replacing p, m and r for the values in the parable, one finds that Hora's manufacturing 
process is a few thousand times more efficient then Tempus'. After this analysis, it is not 
difficult to understand why Tempus struggles while Hora prospers. 

Closing yet another cycle, we thus came to the conclusion that the evolution of complex 
structures requires modular design. The need for modular organization is captured by 
the following dicta of Herbert Simon: 

''Hierarchy, I shall argue, is one of the central structural schemes that the 
architect of complexity uses." Simon (1996, p. 184). 
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"The time required for the evolution of a complex form from simple ele- 
ments depends critically on the number and distribution of potential interme- 
diate stable subassemblies." Simon (1996, p. 190). 

"The claim is that the potential for rapid evolution exists in any complex 
system that consists of a set of subsystems, each operating nearly independently 
of the detailed process going on within the other subsystems, hence influenced 
mainly by the net inputs and outputs of the other subsystems. If the near- 
decomposability condition is met. the efficiency of one component (hence its 
contribution to organism fitness) does not depend on the detailed structure of 
other components." Simon (1996, p. 193). 

Standeirds and Once-Forever Choices 

An important consequence of emerging modularity in evolutive processes is the recurrent 
commitment to once- forever choices and the spontaneous estabhshment of standards. This 
organizational side effect is responsible for mirror-house effects related to many misleading 
questions leading to philosophical dead-ends. Why do (almost all) nations use the French 
meter, m, as the standard unit of length, instead of the older Portuguese vara (~ 1.1m) 
or the British yard (~ 0.9m)? Why did the automotive industry select 87 octane as 
"regular" gasoline and settled for 12V as the standard voltage for vehicles? Why do we 
have chiral symmetry breaks, that is, why do we find only one specific type among two 
or more possible isomeric molecular forms in organic life? What is so special about the 
DNA - RNA genetic code that it is shared by all living organisms on planet earth? 

In this mirror house wc must accept that the deepest truth is often pretty shallow. 
Refusing to do so, insisting on extraction by forceps of more elaborate explanations, 
can take us seriously astray into foggy illusions, far away from clear reason and real 
understanding. Eigen (1977, p. 541-542) makes the following comments: 

The Paradigm of Unity and Diversity in Evolution: Why do millions of 
species, plants and animals, exist, while there is only one basic molecular 
machinery of the cell, one universal genetic code and unique chiralities of the 
macromolecules? 

This code became finally established, not because it was the only alterna- 
tive, but rather due to a peculiar 'once-forever' selection mechanism, which 
could start from any random assignment. Once-forever selection is a conse- 
quence of hypercyclic organization. 

6.5 Squaring the Cycle 

Ouroboros is a Greek name, Oupofiopq oipi<i, meaning the tail-devouring snake, see Eleazar 
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(1760) and Franz (1981). It is also an ancient alchemical symbol of self-reflexive or cyclic 
processes, of something perpetually re-creating itself. In modern cybernetics it is used 
as a representation of autopoiesis. The ourobouros is represented as a single, integral 
organism, the snake, whose head bites its own tail. This pictorial representation would 
not make much sense if the snake were cut into several pieces, yet, that is what may 
happen, if we are not careful, when trying to explain a cyclic process. 

Let us illustrate this discussion with a schematic representation of the fiscal cycle of 
an idealized republic. This cycle is represented by a diagram similar to the one presented 
in section 7. This square diagram has four arrows pointing, respectively, 

- Down: Citizens pay taxes to fulfill their duties; 

- Left: Citizens elect a senate or a house of representatives; 

- Up: The senate legislates fiscal policies; and 

- Right: A revenue service enforces fiscal legislation. 

Focusing on each one of the arrows we can speak, respectively, of 

- Downward causation, whereby individuals comply with established social constraints; 

- Upward causation, whereby the systems constraints are established and renewed; 

- Leftward causation, whereby individuals (re)present new demands to the republic; 

- Rightward causation, whereby the status quo is maintained, stabilized and enforced. 

Each one of these causal relations is indeed helpful to understand the dynamic of our 
idealized republic. On the other hand, the omission of any single one of these relations 
breaks the cycle, and such an incomplete version of the schematic diagram would no longer 
explain a dynamical system. 

The adjectives up and down capture our feelings as an individual living under social 
constraints (like costumes, moral rules, laws and regulations) that may (seem to) be 
overwhelming, while the adjectives left and right are late echoes of the seating arrangement 
in the French legislative assembly of 1791, with the conservatives, protecting aristocratic 
privileges of the ancien regime, seating on the right and the liberals, voicing the laissez- 
faire-laissez-passer slogans for free market capitalism, seating on the left. How to assign 
intuitive and meaningful positional or directional adjectives to links in a complex network 
is in general not so obvious. In fact, insisting on similar labeling practices is a common 
source of unnecessary confusion and misunderstanding. A practice that easily generates 
inappropriate interpretations is polysemy, the reuse of the same tags in different contexts. 
This is due to semantic contamination or spill over, that is, unwanted or unforeseen 
transfers of meaning, induced by polysemic overloading. 

We can ask several questions concerning the relative importance of specific links in 
causal networks. For example: Can we or should we by any means establish precedences 
between the links in our diagram? Upward causes precede or have higher status then 
downward causes or vice versa? Rightward causes explain or have preponderance over 
leftward causes or vice versa? Do any of the possible answers imply a progressive or 
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revolutionary view? Do the opposite answers imply a conservative or reactionary view? 
The same questions can be asked with respect to a similar diagram for scientific production 
presented in section 7. Do any of the possible answers imply an empiricist or Aristotelic 
view? Do the opposite answers imply an idealistic or Platonic view? 

To some degree these can be legitimate questions and consequently, to the same degree, 
motivate appropriate answers. Nevertheless, following the main goal of this chapter, 
namely, the exploration of mirror-house illusions, we want to stress that extreme forms of 
these questions often lead to ill posed problems. Therefore, extreme answers to the same 
questions often give an over simplified, one sided, biased, or distorted view of reality. The 
dangerous consequences of acceding to the temptation of having an appetizing ourobourus' 
slice for supper are depicted, in the field of psychology, by the following quotations from 
Efran (1990, p.99,47): 

Using language, any cycle can be broken into causes and purposes... Note 
that inventing purposes - and they are invented - is usually an exercise in 
creating tautologies. A description is turned into a purpose that is then asked 
to account for the description. [A typical example] starts with the defining 
characteristic of life, self-perpetuation, and states that it is the purpose for 
which the characteristic exists. Such circular renamings are not illegal, but 
they do not advance the cause (no pun intended), (p. 99) 

For a living system there is a unity between product and process: In other 
words, the major line of work for a living system is creating more of itself. 

Autopoiesis in neither a promise nor a purpose - it is an organizational 
characteristic. This means that life lasts as long as it lasts. It doesn't come 
with guarantees. In contrast to what we are tempted to believe, people do 
not stay alive because of their strong survival instincts or because they have 
an important job to complete. They stay alive because their autopoietic or- 
ganization happens to permit it. When the essentials of that organization are 
lost, a person's career comes to an end - he or she disintegrates, (p. 47) 

6.6 Emergence and Asymptotics 

Asymptotic entities emerge in a model as a law of large numbers, that IS, clS db stable 
behavior of a quantity in the limiting case of model parameters corresponding to a sys- 
tem with very many (asymptotically infinite) components. The familiar mathematical 
notation used in these cases takes the form lim„^.oo fi'(^) or limg^.o/(e). Typically, the 
underlying model describes a local interaction in a small or microscopic scale, while the 
resulting limit correspond to a global behavior in a large or macroscopic scale. 

The paradigmatic examples in this class express the behavior of thermodynamic vari- 
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ables describing a system, such as volume, pressure and temperature of a gas, as asymp- 
totic limits in statistical mechanics models for (infinitely) many interacting particles, like 
atoms or molecules, see Atkins (1984), Tarasov (1988). Other well known examples ex- 
plain the behavior of macro-econometric relations among descriptive variables of efficient 
markets, like aggregated supply, demand, price and production, form micro-economic 
models for the interaction of individual agents, see Ingrao and Israel (1990). Even or- 
ganic tissue movements in morphogenesis can be understood as the asymptotic limit of 
local cellular interactions at microscopic scale, as already mentioned in section 2. In this 
section we have chosen to examine some aspects of the collective behavior of flocks, schools 
and swarms, that can be easily visualized in a space and time scale directly assessible to 
our senses. 

Large flocks of birds or schools of flsh exhibit coordinated flight or swimming patterns 
and manifest collective reaction movements that give the impression that the collective 
entity has "a mind of its own" . There are many explanations for why these animals swarm 
together. For example, they may do so in order to achieve: 

- Better aerodynamic or hydrodynamic performance by flying or swimming in tight for- 
mation, 

- More efficient detection of needed resources or dangerous threats by the pooling of many 
sensors; 

- Increased reproductive and evolutive success by social selection rules; etc. 
In this section, however, we will focus on another advantage: 

- Reducing the risk of predation by evasive maneuvers. 

The first point in the analysis of this example is to explain why it is a valid example 
of emergence, that is, to describe a possible local interaction model from which the global 
behavior emerges when the flock has a large number of individuals. We use the model 
programmed by Craig Reynolds (1987). 



In 1986 I made a computer model of coordinated animal motion such as 
bird flocks and fish schools. It was based on three dimensional computational 
geometry of the sort normally used in computer animation or computer aided 
design. I called the generic simulated flocking creatures boids. The basic 
flocking model consists of three simple steering behaviors which describe how 
an individual bold maneuvers based on the positions and velocities its nearby 
flockmates: 

Separation: steer to avoid crowding local flockmates 
Alignment: steer towards the average heading of local flockmates 
Cohesion: steer to move toward the average position of local flockmates 
Each bold has direct access to the whole scene's geometric description, 

but flocking requires that it reacts only to flockmates within a certain small 

neighborhood around itself. 
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The second point is to explain why being part of a flock can reduce the risk of predation: 
Many predators, like a falcon hunting a sparrow, need to single out and focus on a chosen 
individual in order to strike accurately. However, the rapid change of relative positions of 
individuals in the flock makes it difficult to isolate a single individual as the designated 
target and follow it inside the moving flock. Computer simulation models show that this 
confusion effect greatly reduces the killing (success) rate in this kind of hunt. 

The third point in our analysis is to contrast the hunting of single individuals, as 
analyzed in the previous paragraph, with other forms of predation based on the capture 
of the entire flock, or a large chunk of it. The focus of such alternative hunting techniques 
is, in the relative topology of the flock, not on local but on global variables describing the 
collective entity. For example, as explained in Diachok (2006) and Leighton et al. (2004, 
2007), humpback whales collaborate using sophisticated strategies for hunting herring, 
including speciflc tactics for: 

Detection: Whales use active sonar detection techniques, using specific frequencies 
that resonates with and are attenuated by the swim bladders of the herring. In this 
way, the whales can detect schools over long distances, and also measure its pertinent 
characteristics. 

Steering: Some whales broadcast loud sounds below the herring school, driving them 
to the surface. Other whales blow a bubble-net around the school, spiraling in as the 
school rises. The herring is afraid of the loud sounds at the bottom, and also afraid of 
swimming through the bubble-net, and is thus forced into a dense pack at a compact 
killing zone near the surface. 

Capture: Finally, the whales take turns at the killing zone, raising to the surface with 
their mouths wide open, catching hundreds of fish at a time or, so to speak, "biting of" 
large chunks of the school. 

Finally, let us propose two short statements that can be distilled from our examples. 
They are going to carry us to the next section. 

- Flocking makes it difficult for a predator to use local tactics tracking the trajectory 
of a single individual, consequently, for a hunter that focus on local variables it is hard to 
know what exactly is going on. 

- On the other hand, the same collective behavior creates the opportunity for global 
strategies that track and manipulate the entire flock. These hunting technique may be 
very efficient, in which case, we can say that the hunters know very well what they are 
doing. 
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6.7 Constructive Ontologies 



From the several examples mentioned in sections 2, 4 and 6, we can suspect that the 
emergence of properties, behaviors, organizational forms and other entities are the rule 
rather than the exception for many non-trivial systems. Hence it is natural to ask about 
the ontological status of such entities. Ontology is a term used in philosophy referring 
to a systematic account of existence or reality. In this section we analyze the ontological 
status of emergent entities according to the Cog-Con epistemological framework. The 
following paragraphs give a brief summary of this perspective, as well as some specific 
epistemological terms as they are used in the Cog-Con framework. 

The interpretation of scientific knowledge as an cigcnsolution of a research process is 
part of a Cog-Con approach to cpistcmology. Figure 1 presents an idealized structure and 
dynamics of knowledge production. This diagram represents, on the Experiment side (left 
column) the laboratory or field operations of an empirical science, where experiments are 
designed and built, observable effects are generated and measured, and the experimental 
data bank is assembled. On the Theory side (right column), the diagram represents the 
theoretical work of statistical analysis, interpretation and (hopefully) understanding ac- 
cording to accepted patterns. If necessary, new hypotheses (including whole new theories) 
are formulated, motivating the design of new experiments. Theory and experiment con- 
stitute a double feed-back cycle making it clear that the design of experiments is guided 
by the existing theory and its interpretation, which, in turn, must be constantly checked, 
adapted or modified in order to cope with the observed experiments. The whole system 
constitutes an autopoietic unit. 

The Cog-Con framework also includes the following definition of reality and some 
related terms: 
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1. Known (knowable) Object: An actual (potential) eigen-solution of a given 
system's interaction with its environment. In the sequel, we may use a some- 
what more friendly terminology by simply using the term Object. 

2. Objective (how, less, more): Degree of conformance of an object to the 
essential attributes of an eigen-solution (to be precise, stable, separable and 
composable) . 

3. Reality: A (maximal) set of objects, as recognized by a given system, when 
interacting with single objects or with compositions of objects in that set. 

The Cog-Con framework assumes that an object is always observed by an observer, just 
like a living organism or a more abstract system, interacting with its environment. There- 
fore, this framework asserts that the manifestation of the corresponding eigen-solution and 
the properties of the object are respectively driven and specified by both the system and 
its environment. More concisely, Cog-Con sustains: 

4- Idealism: The belief that a system's knowledge of an object is always 
dependent on the systems' autopoietic relations. 

5. Realism: The behef that a system's knowledge of an object is always 
dependent on the environment's constraints. 

Consequently, the Cog-Con perspective requires a fine equilibrium, called Realistic or 
Objective Idealism. Solipsism or Skepticism are symptoms of an epistemological analyses 
that looses the proper balance by putting too much weight on the idealistic side. Con- 
versely, Dogmatic Realism is a symptom of an epistemological analyses that looses the 
proper balance by putting too much weight on the realistic side. Dogmatic realism has 
been, from the Cog-Con perspective, a very common (but mistaken) position in modern 
epistemology. Therefore, it is useful to have a specific expression, namely, something in 
itself to be used as a marker or label for such ill posed dogmatic statements. The method 
used to access something in itself is often described as: - Something that an observer 
would observe if the (same) observer did not exist, or - Something that an observer could 
observe if he made no observations, or - Something that an observer should observe in the 
environment without interacting with it (or disturbing it in any way), and many other 
equally senseless variations. 

From the preceding considerations, it should become clear that, from the Cog-Con 
perspective, the ontological status of emergent entities can be perfectly fine, as long 
as these objects correspond to precise, stable, separable and composable eigen-solutions. 
However there is a long list of historical objections and complaints concerning such entities. 
The following quotations from Pihlstrom and El-Hani (2002) elaborate on this point. 

Emergent properties are not metaphysically real independently of our prac- 
tices of inquiry but gain their ontological status from the practice-laden onto- 
logical commitments we make. 
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[concerning] the issue of the ontological epistemological status of emergents 
... we simply need to be careful in our recognition of emergent phenomena 
and continually ask the question of whether the pattern we see is more in our 
eye than the pattern we are claiming to see. 

Related to the supposed provisionality of emergents is the issue of their 
ontological status: Are emergent phenomena part of the real, authentic "fur- 
niture of the world", or are they merely a function of our epistemological, 
cognitive apparatus with its ever-ready mechanism of projecting patterns on 
to the world? 

From the summary of the Cog-Con epistemological framework presented above we 
conclude that, from this perspective, we have to agree with the first observations, and 
consider the last question as an ill posed problem. 

Another set of historical issues concerning the ontological status of emergents relates 
to our ways of understanding them. For some authors, "real" emergent entities must be 
genuinely "new", in the sense of being unanalyzable or unexplainable. For such authors, 
understanding is a mortal sin that threatens the very existence on an entity, that is, 
understanding undermines their ontological status. Hence, according to these authors, 
the most real of entities should always be somewhat mysterious. Vieira and El- Hani 
(2009, p. 105), analyze this position: 

A systemic property P of a system S will be irreducible if it does not follow, 
even in principle, from the behavior of the system's parts that S has property 
P. 

If a phenomenon is emergent by reasons of being unanalyzable, it will 
be an unexplainable, brute fact, or, to use Alexander's (1920/1979) words, 
something to be accepted with natural piety. We will not be able to predict 
or explain it, even if we know its basal conditions. 

In our view, if the understanding of the irreducibility of emergent properties 
is limited to this rather strong sense, we may lose from sight the usefulness 
of the concept... Indeed, claims about emergence turn out to be so strong, 
if interpreted exclusively in accordance with this mode of irreducibility, that 
they are likely to be false, at least in the domain of natural science (with are 
our primary interest in this chapter). 

We fully agree with Vieira and El-Hani in rejecting unanalyzability or unexplainabil- 
ity as conditions for the "real existence" of emergent entities. As expected, the Cog-Con 
framework does not punish understanding, far from it. In Chapter 4, the Cog-Con per- 
spective for the meaning of objects in a specific reality is given by their interrelation 
in a network of causal nexus, explaining why the corresponding eigen-solutions manifest 
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themselves the way they do. Such explanations include, specially in modern science, the 
symbolic derivation of scientific hypotheses from general scientific laws, the formulation 
of new laws in an existing theory, and even the conception of new theories, as well as 
their general understanding based on accepted metaphysical principles. In the Cog-Con 
perspective, the understanding of an entity can only strengthen its ontological status, 
embedding it even deeper in the system's life, endowing it with even wider connections in 
the web of concepts, revealing more of its links with the great chain of being! 



6.8 Distinctions and Probability 

In the last two sections we have analyzed emergent objects and their properties. In 
many of the examples used in our discussions, probability mechanisms where at the core 
of the emergence process. In this section, other ways in which probability mechanisms 

can generate complex or non-trivial structures will be presented. This section is also 
dedicated to the study of the ontological status of probability, and the role played by 
explanations given by probabilistic mechanisms and stochastic causal relations. We begin 
our discussion examining the concept of mixed strategies in game theory, due to von 
Neumann and Morgenstern. 

Let us consider the matching pennies game, played by Odd and Even. Each of the 
players has to show, simultaneously, a bit (0 or 1). If both bits agree (i.e., 00 or 11), Odd 
wins. If both bits disagree (i.e., 01 or 10), Even wins. Both players only have two pure 
or deterministic strategies available from which to choose: sq - show a 0, or si - show a 1. 

A solution, equilibrium or saddlepoint of a game is a set of strategies that leaves each 
player at a local optimum, that is, a point at which each player, having full knowledge of all 
the other players' strategies at that equilibrium point, has nothing to gain by unilaterally 
changing his own strategy. It is easy to see that, considering only the two deterministic 
strategies, the game of matching pennies has no equilibrium point. If Odd knows the 
strategy chosen by Even, he can just take the same strategy and win the game. In the 
same way. Even can take the opposite choice of Odd's, and win the game. 

Let us now expand the set of strategies available to each player considering mixed or 
randomized strategies, where each player picks among the pure strategies according to a 
set of probabilities he specifies. We assume that a proper randomization device, like a 
dice, a roulette or a computer with a random number generator program, is available. In 
the example at hand. Even and Odd can each specify a probability, respectively, pe and 
pa, for showing a 1, and qe = 1 — pe and qo = 1 — po, for showing a 0. It is easy to check 
that pe = po — 1/2 is a. solution to this game. 

Oskar Morgenstern (2008, p. 270) makes the following comments about the philosoph- 
ical significance of mixed strategies: 
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It is necessary to examine the significance of tlie use of mixed strategies 
since they involve probabihties in situations in which 'rational' behavior is 
looked for. It seems difficult, at first, to accept the idea that 'rationality' - 
which appears to demand a clear, definite plan, a deterministic resolution - 
should be achieved by the use of probabilistic devices. Yet precisely such is 
the case. 

In games of chance the task is to determine and then to evaluate proba- 
bilities inherent in the game; in games of strategy we introduce probability in 
order to obtain the optimal choice of strategy. This is philosophically of some 
interest. 

The role played by mixed strategies can be explained, at least in part, by convex 
geometry. A convex combination of two points, po and pi, is a point lying on the line 
segment joining them, that is, a point of the form p{X) = (1 — A)po + ^Pi, < A < 1. A 
convex set is a set that contains all convex combinations of its points. The extreme points 
of a convex set are those that can not be expressed as (non-trivial) convex combinations 
of other points in the set. A function f{x) is convex if its epigraph, epi(/) - the set of 
all point above the graph of f{x), is convex. A convex optimization problem consists of 
minimizing a convex function over a convex region. The properties of convex geometry 
warrant that a convex optimization problem has an optimal solution, i.e. a minimum, 
f{x*). Moreover, the minimum argument, x* . is easy to compute using a procedure such 
as the steepest descent algorithm, that can be informally stated as follows: Place a particle 
at some point over the graph of f{x), and let it "roll down the hill" to the bottom of the 
valley, until it finds its lowest point at Luenberger (1984) and Minoux (1986). 

In the matching pennies game, let us consider a convex combination of the two pure 
strategies, that is, a strategy of the form s(A) = (1 — A)so + Asi, < A < 1. Since the pure 
strategies form a discrete set, such continuous combination of pure strategies is not even 
well defined, except for the trivial extreme cases, A = or A = 1. The introduction of ran- 
domization gives a coherent definition for convex combinations of existing strategies and, 
in so doing, it expands the set of available (mixed) strategies to a convex set where pure 
strategies become extreme points. In this setting, a game equilibrium point can be charac- 
terized as the solution of a convex optimization problem. Therefore, such an equilibrium 
point exists and is easy to compute. This is one way of having a geometric understanding 
of von Neumann and Morgenstein theorems, as well as to subsequent extensions in game 
theory due to John F. Nash, see Bonassi et al. (2009), Mesterton-Gibbons (1992) and 
Thomas (1986). 

The matching pennies example poses a StXrj^fia, dilemma - a problem offering two 
possibilities, none of which is acceptable. The conceptual dichotomy created by constrain- 
ing the players to only two deterministic strategies creates an ambush. Caught in this 
ambush, both players would be trapped, forever changing their minds between extreme 
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options. Randomization expands the universe of available possibilities and, in so doing, 
allows the players to escape the perpetual flip-flopping at this discrete logic decision trap. 
In section 8.2, we extrapolate this example and generalize these conclusions. However, be- 
fore proceeding in this direction, we shall analyze, in the next section, some objections to 
the concepts of probability, statistics and randomization posed by George Spencer-Brown, 
a philosopher of great influence in the fleld of radical constructivism. 

6.8.1 Spencer-Brown, Probability and Statistics 

Spencer-Brown (1953, 1957) analyzed some apparent paradoxes involving the concept of 
randomness, and concluded that the language of probability and statistics is inappropri- 
ate for the practice of scientiflc inference. In subsequent work. Spencer Brown (1969) 
reformulates classical logic using only a generalized nor operator (marked not-and un- 
marked or), that he represents a la mode of Charles Saunders Peirce or John Venn, 
using a graphical boundary or distinction mark, see Edwards (2004), Kauffmann (2001, 
2003), Meguire (2003), Peirce (1880), Sheffer (1913). Making distinctions is, according to 
Spencer-Brown, the basic (if not the only) operation of human knowledge, an idea that has 
either influenced or been directly explored by several authors in the radical constructivist 
movement. Some typical arguments used by Spencer-Brown in his rejection of probability 
and statistics are given in the next quotations from Spencer-Brown (1957, p. 66, 105, 113): 

We have found so far that the concept of probabihty used in statistical 
science is meaningless in its own terms; but we have found also that, however 
meaningful it might have been, its meaningfulness would nevertheless have 
remained fruitless because of the impossibility of gaining information from 
experimental results, however significant This final paradox, in some ways the 
most beautiful, I shall call the Experimental Paradox (p. 66). 

The essence of randomness has been taken to be absence of pattern. But 
has not hitherto been faced is that the absence of one pattern logically demands 
the presence of another. It is a mathematical contradiction to say that a series 
has no pattern; the most we can say is that it has no pattern that anyone is 
likely to look for. The concept of randomness bears meaning only in relation 
to the observer: If two observers habitually look for different kinds of pattern 
they are bound to disagree upon the series which they call random, (p. 105). 

In Section G.l I carefully explain why I disagree with Spencer-Brown's analysis of 
probability and statistics. In some of my arguments I dissent from Spencer-Brown's 
interpretation of measures of order-disorder in sequential signals. These arguments are 
based on information theory and the notion of entropy. Atkins (1984), Attneave (1959), 
Dugdale (1996), Krippendorff (1986) and Tarasov (1988) review some of the basic concepts 
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in this area using only elementary mathematics. For more advanced works see Kapur 
(1989), Rissanen (1989) and Wallace (2005). Several authors concur, at least in part, 
with my opinion about Sencer-Brown's analysis of probability and statistics, see Flew 
(1959), Falk and Konold (1997), Good (1958) and Mundle (1959). 

I also disapprove some of Spencer Brown's proposed methodologies to detect "rele- 
vant" event sequences, that is, his criteria to "mark distinct patterns" in empirical ob- 
servations. My objections have a lot in common with the standard caveats against ex 
post facto "fishing expeditions" for interesting outcomes, or simple post hoc "sub-group 
analysis" in experimental data banks. This kind of retroactive or retrospective data anal- 
yses is considered a questionable statistical practice, and pointed as the culprit of many 
misconceived studies, misleading arguments and mistaken conclusions. The literature of 
statistical methodology for clinical trials has been particularly active in warning against 
this kind of practice, see Tribble (2008) and Wang (2007) for two interesting papers ad- 
dressing this specific issue and published in high impact medicine journals less than a 
year before I began writing this chapter. When consulting for pharmaceutical companies 
or advising in the design of statistical experiments, I often find it useful to quote Conan 
Doyle's Sherlock Holmes, in The Adventure of Wisteria Lodge: 

Still, it is an error to argue in front of your data. You find yourself insensibly 
twisting them around to fit your theories. 

Finally, I am suspicious or skeptical about some of the intended applications of Spencer- 
Brown's research program, including the use of extrasensory empathic perception for 
coded message communication, exercises on object manipulation using paranormal pow- 
ers, etc. Unable to reconcile his psychic research program with statistical science, Spencer- 
Brown had no regrets in disqualifying the later, as he clearly stated at the prestigious 
scientific journal Nature, Spence-Brown (1953b, p. 594-595): 

[On telepathy:] Taking the psychical research data (that is, the residuum when 
fraud and incompetence are excluded), I tried to show that these now threw 
more doubt upon existing pre-suppositions in the theory of probability than 
in the theory of communication. 

[On psychokinesis:] If such an 'agency' could thus 'upset' a process of ran- 
domizing, then all our conclusions drawn through the statistical tests of sig- 
nificance would be equally affected, including the the conclusions about the 
'psychokinesis' experiments themselves. (How are the target numbers for the 
die throws to be randomly chosen? By more die throws?) To speak of an 
'agency' which can 'upset' any process of randomization in an uncontrollable 
manner is logically equivalent to speaking of an inadequacy in the theoretical 
model for empirical randomness, like the luminiferous ether of an earlier con- 
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troversy, becomes, with the obsolescence of the calculus in which it occurs, a 
superfluous term. 

Sencer-Brown's (1953, 1957) conclusions, including his analysis of probability, were 
considered to be controversial (if not unreasonable or extravagant) even by his own col- 
leagues at the Society of Psychical Research, see Scott (1958), and Soal (1953). It seems 
that current research in this area, even if not free (or afraid) of criticism, has abandoned 
the path of naiVc confrontation with statistical science, see Atmanspachcr (2005) and 
Ehm (2005). For additional comments, see Henning (2006), Kaptchuk and Kerr (2004), 
Utts (1991), and Wassermann (1955). 

Curiously, Charles Saunders Pcirce and his student Joseph Jastrow, who introduced 
the idea of randomization in statistical trials, struggled with some of the very same dilem- 
mas faced by Spencer-Brown, namely, the eventual detection of distinct patterns or seem- 
ingly ordered (sub)strings in a long random sequence. Pcirce and Jastrow did not have at 
their disposal the heavy mathematical artillery 1 have cited in the previous paragraphs. 
Nevertheless, like experienced explorers that when traveling in the desert are not lured by 
the mirage of a misplaced oasis, these intrepid pioneers were able to avoid the conceptual 
pitfalls that lead Spencer-Brown so far astray. For more details see Bonassi et al. (2008), 
Dehue (1997), Hacking (1988), and Peirce and Jastrow (1885). 

As stated in the introduction, the Cog-Con framework is supported by the FBST, a 
formahsm based on a non-decision theoretic form of Bayesian statistics. The FBST was 

conceived as a tool for validating objective knowledge and, in this role, it can be easily 
integrated to the Cog-Con epistemological framework in the practice of scientific research. 
Contrasting our distinct views of cognitive constructivism, it is not at all surprising that 
1 have come to conclusions concerning the use of probability and statistics, and also to 
the relation between probability and logic, that are fundamentally different from those of 
Spencer-Brown. 

6.8.2 Overcoming Dilemmas and Conceptual Dichotomies 

As stated by William James, our ways of understanding require us to split reality with 
conceptual distinctions. The non-trivial consequences of the resulting dichotomies are 
captured, almost poetically, by James (1909, Lecture VI) in the following passage from A 
Pluralistic Universe: 

The essence of life is its continuously changing character; but our concepts 
are all discontinuous and fixed, and the only mode of making them coincide 
with life is by arbitrarily supposing positions of arrest therein. With such 
arrests our concepts may be made congruent. But these concepts are not 
parts of reality, not real positions taken by it, but suppositions rather, notes 
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taken by ourselves, and you can no more dip up the substance of reality with 
them than you can dip up water with a net, however finely meshed. 

When we conceptualize, we cut out and fix, and exclude everything but 
what we have fixed. A concept means a that-and-no-other. Conceptually, 
time excludes space; motion and rest exclude each other; approach excludes 
contact; presence excludes absence; unity excludes plurality; independence 
excludes relativity; 'mine' excludes 'yours'; this connection excludes that con- 
nection - and so on indefinitely; whereas in the real concrete sensible flux of 
life experiences compenetrate each other so that it is not easy to know just 
what is excluded and what not... 

The conception of the first half of the interval between Achilles and the tor- 
toise excludes that of the last half, and the mathematical necessity of travers- 
ing it separately before the last half is traversed stands permanently in the 
way of the last half ever being traversed. Meanwhile the hving Achilles... asks 
no leave of logic. 

Sure enough, our way of understanding requires us to make those conceptual distinc- 
tions that are most adequate (or adequate enough) for a given reality domain. However, 
the concepts that are appropriate to analyze reality at a given level, scale or granularity, 
may not be adequate at the next level, that may be lower or higher, larger or smaller, 
coarser or finer. How then can we avoid being trapped by such distinctions? How can 
we overcome the distinctions made at one level in order to be able to reach the next, and 
still maintain a coherent or congruent view of the universe? 

The Cog-Con endeavor requires languages and mechanisms to overcome the limita- 
tions of conceptual distinctions and, at the same time, enable us to coherently build new 
concepts that can be used at the next or new domains. Of course, as in all scientific 
research, the goal of the new conceptual constructs is to entail theories and hypotheses 
providing objective knowledge (in its proper domain) , and the success of the new theories 
must be judged pragmatically according to this goal. I claim that statistical models and 
their corresponding probabilistic mechanisms, have been, in the history of modern science, 
among the most successful tools for accomplishing the task at hand. In Chapter 5, for 
example, we have shown in some detail how probabilistic reasoning can be used: 

- In quantum mechanics, using the language of Fourier series and transforms, to over- 
come the dilemmas posed by a physical theory using concepts and laws coming from two 
distinct and seemingly incompatible categories: The mechanics of discrete particles and 
wave propagation in continuous media or fields. 

- In stochastic optimization, using the language of inhomogeneous Markov chains, to 
overcome the dilemmas generated by dynamic populations of individuals with the need 
of reliable reproduction, hierarchical organization, and stable building blocks versus the 
need of creative evolution with innovative change or mutation. 
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In an empirical science, from a pragmatical perspective, probability reasoning seems 
to be an efficient tool for overcoming artificial dichotomies, allowing us to bridge the gaps 
created by our own conceptual distinctions. Such probabilistic models have been able to 
generate new eigen-solutions with very good characteristics, that is, eigen-solutions that 
are very objective (precise, stable, separable and composable). These new objects can then 
be used as stepping stones or building blocks for the construction of new, higher order 
theories. In this context, we thus assign, coherently with the Cog-Con epistemological 
framework, a high ontological status to probabilistic concepts and causation mechanisms, 
that is, we use a notion of probability that has a distinctively objective character. 

6.9 Final Remarks and Future Research 

The objective of this chapter was to use the Cog-Con framework for the understanding 
of massively complex and non-trivial systems. We have analyzed several forms of system 
complexity, several ways in which systems become non-trivial, and some interesting con- 
sequences, side effects and paradoxes generated by such non-triviality. How can we call 
the massive non-triviality found in nature? I call it The Living and Intelligent Universe. 
I could also call it Deus sive natura or, according to Einstein, 

Spinoza's God, a God who reveals himself in the orderly harmony of what 
exists. . . 

In future research we would like to extend the use of the same Cog-Con framework 
to the analysis of the ethical conduct of agents that are conscious and (to some degree) 
self-aware. The definition of ethics given by Russell (1999, p. 67), reads: 

The problem of Ethics is to produce a harmony and self- consistency in conduct, 
hut mere self- consistency within the limits of the individual might he attained in 
many ways. There must therefore, to make the solution definite, he a universal 
harmony; my conduct must bring satisfaction not merely to myself, hut to all 
whom it affects, so far as that is possible. 

Hence, in this setting, such a research program should be concerned with the understand- 
ing and evaluation of choices and decisions made by agents, acting in a system in which 
they belong. Such an analysis should provide criteria for addressing the coherence and 
consistency of the behavior of such agents, including the direct, indirect and refiexive 
consequences of their actions. Moreover, since we consider conscious agents, their values, 
beliefs and ideas should also be included in the proposed models. The importance of pur- 
suing this line of research, and also the inherent difficulties of this task, are summarized 
by Eigen (1992, p. 126): 
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But long and difficult will be our ascent from the lowest landing up to the 
topmost level of life, the level of self-awareness: our continued ascent from 
man to humanity. 

Goertzel (2008) points to generalizations of standard probabilistic and logical for- 
malisms, and urges us to explore further connections between them, see for example 
Borges and Stern (2007), Caticha (2008), Costa (1986, 1993), Jaynes (1990), Stern (2004) 
and Youssef (1994, 1995). I am fully convinced that this path of cross fertihzation between 
probability and logic is another important field for future research. 
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Epilog 



In six chapters and ten appendices, we have presented our case in defense of a construc- 
tivist epistemological framework and the use of compatible statistical theory and inference 
tools. In this final remarks, wc shall try to wrap up, as concisely as possible, the reasons 
for adopting the constructivist world-view. 

The basic metaphor of decision theory is the maximization of a gambler's expected 
fortune, according to his own subjective utility, prior beliefs an learned experiences. This 
metaphor has proven to be very useful, leading the development of Bayesian statistics 
since its XX-th century revival, rooted on the work of de Finetti, Savage and others. 

The basic metaphor presented in this text, as a foundation for cognitive constructivism, 
is that of an eigen-solution, and the verification of its objective epistemic status. The 
FBST is the cornerstone of a set of statistical tolls conceived to assess the epistemic value 
of such eigen-solutions, according to their four essential attributes, namely, sharpness, 
stability, separability and composability. We believe that this alternative perspective, 
complementary to the one offered by decision theory, can provide powerful insights and 
make pertinent contributions in the context of scientific research. 

To fulfill our promise of concision, we finish here this summer course / tutorial. We 
sincerelly thank the readers for their attention and welcome their constructive comments. 
May the blessings of the three holy knights in Figure J. 2-4 protect and guide you in your 
way. Fair well and goodbye! 
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"E aquela era a hora do mais tarde. 
O ceu vem abaixando. Narrei ao senhor. 
No que narrei, o senhor talvez ate ache, 
mais do que eu, a minha verdade. 

Fim que foi. " 

And it was already the time of later on, 
the time of sun-down. My story I have told, 
my lord, so that you may find, perhaps even 
better than me, the truth I wanted to tell. 

The End (that already was). 

"Vivendo, se aprende; mas a que se aprende, 
mais, e so a fazer outras maiores perguntas. " 

Living one learns, but what one learns, 
is only how to ask even bigger questions. 

Joao Guimaraes Rosa (1908-1967). 

Grande Sertao: Veredas. 
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Appendix A 
FBST Review 



Y A ) man's logical method should he loved and reverenced as 
his bride, whom he has chosen from all the world. He need not 
contemn the others; on the contrary, he may honor them deeply, 
and in doing so he honors her more. But she is the one that he 
has chosen, and he knows that he was right in making that choice. ' 

C.S.Peirce (1839 - 1914), 
The Fixation of Belief (1877). 

"Make everything as simple as possible, but not simpler. ' 

Albert Einstein (1879 - 1955). 



A.l Introduction 

The FBST was specially designed to give a measure of the epistemic value of a sharp 
statistical hypothesis H, given the observations, that is, to give a measure of the value 
of evidence in support of H given by the observations. This measure is given by the 
support function ev {H), the FBST e-value. Furthermore the e- value has many necessary 
or desirable properties for a statistical support function, such as: 

(I) Give an intuitive and simple measure of significance for the hypothesis in test, 
ideally, a probability defined directly in the original or natural parameter space. 

(II) Have an intrinsically geometric definition, independent of any non-geometric as- 
pect, like the particular parameterization of the (manifold representing the) null hypoth- 
esis being tested, or the particular coordinate system chosen for the parameter space, i.e., 
be an invariant procedure. 

(III) Give a measure of significance that is smooth, i.e. continuous and differentiable, 
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on the hypothesis parameters and sample statistics, under appropriate regularity condi- 
tions for the model. 

(IV) Obey the likelihood principle , i.e., the information gathered from observations 
should be represented by, and only by, the likelihood function, see Berger and Wolpert 
(1988), Pawitan (2001, ch.7) and Wechsler et al. (2008). 

(V) Require no ad hoc artifice like assigning a positive prior probability to zero measure 
sets, or setting an arbitrary initial belief ratio between hypotheses. 

(VI) Be a possibilistic support function, where the support of a logical disjunction is 
the maximum support among the support of the disjuncts. 

(VII) Be able to provide a consistent test for a given sharp hypothesis. 

(VIII) Be able to provide compositionality operations in complex models. 

(IX) Be an exact procedure, i.e., make no use of "large sample" asymptotic approxi- 
mations when computing the e-value. 

(X) Allow the incorporation of previous experience or expert's opinion via (subjective) 
prior distributions. 

The objective of this section is to provide a very short review of the FBST theoretical 
framework, summarizing the most important statistical properties of its support function, 
the e-value. It also summarizes the logical (algebraic) properties of the e-value, and 
its relations to other classical support calculi, including possibilistic calculus and logic, 
paraconsistent and classical. Further details, demonstrations of theoretical properties, 
comparison with other statistical tests for sharp hypotheses, and an extensive list of 
references can be found in the author's previous papers. 

A. 2 Bayesian Statistical Models 

A standard model of (parametric) Bayesian statistics concerns an observed (vector) ran- 
dom variable, x, that has a sampling distribution with a specified functional form, p{x \ 9), 
indexed by the (vector) parameter 9. This same functional form, regarded as a function of 
the free variable 9 with a fixed argument x, is the model's likelihood function. In frequen- 
tist or classical statistics, one is allowed to use probability calculus in the sample space, 
but strictly forbidden to do so in the parameter space, that is, x is to be considered as 
a random variable, while 6 is not to be regarded as random in any way. In frequentist 
statistics, 9 should be taken as a 'fixed but unknown quantity' (whatever that means). 

In the Bayesian context, the parameter 9 is regarded as a latent (non-observed) random 
variable. Hence, the same formalism used to express credibility or (un) certainty, namely, 
probability theory, is used in both the sample and the parameter space. Accordingly, the 
joint probability distribution, p{x, 9) should summarize all the information available in a 
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statistical model. Following the rules of probability calculus, the model's joint distribution 
of X and 6 can be factorized either as the likelihood function of the parameter given the 
observation times the prior distribution on 6, or as the posterior density of the parameter 
times the observation's marginal density, 

p{x, 9) — p{x I 6)p{6) — p{6 I x)p{x) . 

The prior probability distribution Pq{9) represents the initial information available 
about the parameter. In this setting, a predictive distribution for the observed random 
variable, is represented by a mixture (or superposition) of stochastic processes, all of 
them with the functional form of the sampling distribution, according to the prior mixing 
(or weights) distribution, 

p{x) = I p{x I 9)po{9)d9 . 
Je 

If we now observe a single event, a;, it follows from the factorizations of the joint dis- 
tribution above that the posterior probability distribution of 9, representing the available 
information about the parameter after the observation, is given by 

p,{9)oip{x\9)po{9) . 

In order to replace the 'proportional to' symbol, oc, by an equality, it is necessary to 
divide the right hand site by the normalization constant, Ci = jgP{x \ 9)po{9)d9. This is 
the Bayes rule, giving the (inverse) probability of the parameter given the data. That is 
the basic learning mechanism of Bayesian statistics. Computing normalization constants 
is often difficult or cumbersome. Hence, especially in large models, it is customary to 
work with unormalized densities or potentials as long as possible in the intermediate 
calculations, computing only the final normalization constants. It is interesting to observe 
that the joint distribution function, taken with fixed x and free argument 9, is a potential 
for the posterior distribution. 

Bayesian learning is a recursive process, where the posterior distribution after a learn- 
ing step becomes the prior distribution for the next step. Assuming that the observations 
are i.i.d. (independent and identically distributed) the posterior distribution after n ob- 
servations, x^^\ . . . x^"\ becomes, 

Pn{9) oc p{x^^^\9)pr,-i{9) oc p(x« I ^)po(^) . 

If possible, it is very convenient to use a conjugate prior, that is, a mixing distribution 
whose functional form is invariant by the Bayes operation in the statistical model at hand. 
For example, the conjugate priors for the Normal and Multivariate models are, respec- 
tively, Wishart and the Dirichlet distributions. The explicit form of these distributions is 
given in the next sections. 
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The 'bcginings and the endings' of the Bayesian learning process really need further 
discussion, that is, we should present some rationale for choosing the prior distribution 
used to start the learning process, and some convergence theorems for the posterior as 
the number observations increases. In order to do so, we must access and measure the 
information content of a (posterior) distribution. Appendix E is dedicated to the concept 
of entropy, the key that unlocks many of the mysteries related to the problems at hand. In 
particular. Sections E.5 and E.6 discuss some fine details about criteria for prior selection 
and posterior convergence properties. 



A. 3 The Epistemic e- values 

Let 9 E & ^ he a vector parameter of interest, and p{x \ 9) be the likelihood associated 
to the observed data as in the standard statistical model. Under the Bayesian paradigm 
the posterior density, Pn{9), is proportional to the product of the likelihood and a prior 
density, 

Pn{9) (xp{x I 9)po{9). 

The (null) hypothesis H states that the parameter lies in the null set, defined by 
inequality and equality constraints given by vector functions g and h in the parameter 
space. 

= e e I g{9) < A h{9) = 0} 

Prom now on, we use a relaxed notation, writing H instead of Qh- We are particularly 
interested in sharp (precise) hypotheses, i.e., those in which there is at least one equality 
constraint and hence, dim(i7) < dim(6). 

The FBST defines ev (if), the e- value supporting (in favor of) the hypothesis H, and 
ev(i7), the e- value against H, as 

= s* ^ s{9*) ^ suve^H s{9) , s ^ s{9) ^ su^e^^ s{9) , 

T{v) = e e I s{9) < v} , W{v) = [ pn (9) d9 , ev (H) = W{s*) , 

Jt{v) 

T(v) = e - T(v) , W(v) = 1 - W(v) , -ev(H) = W(s*) = 1 - ev (//) . 

The function s{9) is known as the posterior surprise relative to a given reference 
density, r{9). W{v) is the cumulative surprise distribution. The surprise function was 
used, among other statisticians, by Good [23], Evans [16] and Royall [48]. Its role in 
the FBST is to make ev (H) explicitly invariant under suitable transformations on the 
coordinate system of the parameter space, see next section. 
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The tangential (to the hypothesis) set T = T(s*), is a Highest Relative Surprise Set 
(HRSS). It contains the points of the parameter space with higher surprise, relative to 
the reference density, than any point in the null set H. When r{6) oc 1, the possibly 
improper uniform density, T is the Posterior's Highest Density Probability Set (HDPS) 
tangential to the null set H. Small values of 'ev{H) indicate that the hypothesis traverses 
high density regions, favoring the hypothesis. 

Notice that, in the FBST definition, there is an optimization step and an integration 
step. The optimization step follows a typical maximum probability argument, according to 
which, "a system is best represented by its highest probability reahzation" . The integra- 
tion step extracts information from the system as a probability weighted average. Many 
inference procedures of classical statistics rely basically on maximization operations, while 
many inference procedures of Bayesian statistics rely on integration (or marginalization) 
operations. In order to achieve all its desired properies, the FBST procedure has to use 
both, as explained in this appendix. 

The evidence value, defined above, has a simple and intuitive geometric characteriza- 
tion. We now illustrate the above definitions with two simple but non-trivial examples. 
These two exemples are easy to visualize, since they have a two dimensional parameter 
space, and are also non-trivial, in the sense that they have a non-linear hypothesis. 

Coefficient of Variation 

The Coefficient of Variation (CV) of a random variable X is defined as the ratio CV{X) — 
a{X)/E{X), i.e. the ratio of its standard deviation to its mean. Let X be a normal 
random variable, with unknown mean and variance. We want to compute the evidence 
value supporting the hypothesis that the coefficient of variation of X is equal to a given 
constant. 



The conjugate family for this problem is the family of bivariate distributions, where 
the conditional distribution of the mean ^, for a fixed precision p — l/o"^, is normal, 
and the marginal distribution of the precision p is gamma, DeGroot (1970). Using the 
standard improper priors, uniform on ] — oo, -|-oo[ for ^, and 1/p on ]0, -|-oo[ for p, we get 
the posterior joint distribution for j3 and p: 



Figure A.l shows the null set H, the tangential HRSS T, and the points of constrained 
and unconstrained maxima, 6* and 9, for testing the hypothesis at hand with the following 



Xr^N{l3,(T) , H: a/P^c 



Pn{/3,p\x) oc ^J~pex'p{ 



np{/3 — /2) p 2 exp{—psn/2) 
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n=16 m=10 c=0.1 n=16 m=10 c=0.1 n=16 m=10 c=0.1 

std=1.0 evicl=0.93 std=1.1 evid=0.67 std=1.5 evid=0.01 




posterior mean, n 



Figure A.l: FBST for H: CV=0.1 

numerical example: CV — 0.1 with 3 samples of size n — 16, mean x — 10 and standard 
deviations std = 1.0, std = 1.1 and std — 1.5. We can see the tangent set expanding as 
the sample standard deviation over mean ratio gets farther away from the coefficient of 
variation being tested, CV{X) = a{X)/E{X) = 0.1. In this example we use the standard 
improper prior density and the uniform reference density. In the first plot, the sample 
standard deviation over mean ratio equals the coefficient of variation tested. Nevertheless, 
the evidence against the null hypothesis is not zero; this is because of the non uniform 
prior. In order to test other hypotheses we only have to change the constraint(s) passed 
to the optimizer. Constraints for the hypothesis (3 — c and a — c would be represented 
by, respectively, vertical and horizontal lines. All the details for these and other simple 
examples, as well as comparisons with standard frequentist and Bayesian tests, can be 
found in Irony et al. (2001), Pereira and Stern (1999b, 2000a,b) and Pereira and Wechsler 
(1993). 

Hardy- Weinberg equilibrium 

Figure A. 2 shows the null set H, the tangential HRSS T, and the points of constrained 
and unconstrained maxima, 6* and 6, for testing Hardy- Weinberg equilibrium law in a 
population genetics problem, as discussed in Pereira and Stern (1999). In this biological 
application n is the sample size, Xi and are the two homozygote sample counts and 
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Figure A. 2: H-W Hypothesis and Tangential Set 



X2 = n — xi — X3 is heterozygote sample count. 9 — [6*1, ^^2, ^3] is the parameter vector. 
The posterior and maximum entropy reference densities for this trinomial model, the 
parameter space and the null set are: 

e = > 1 ^1 + ^2 + ^3 = 1} , H = {9ee\93 = {i-^/fif} . 

Nuisance Parameters 

Let us consider the situation where the hypothesis constraint, H : h{9) = h{d) =0,9 = 
[5, A] is not a function of some of the parameters, A. This situation is described by D.Basu 
in Ghosh (1988): 

"If the inference problem at hand relates only to 5, and if information 
gained on A is of no direct relevance to the problem, then we classify A as the 
Nuisance Parameter. The big question in statistics is: How can we eliminate 
the nuisance parameter from the argument?" 

Basu goes on hsting at least 10 categories of procedures to achieve this goal, like using 
max\ or J d\, the maximization or integration operators, in order to obtain a projected 
profile or marginal posterior function, p{5 \ x). The FBST does not follow the nuisance 
parameters elimination paradigm, working in the original parameter space, in its full 
dimension. 

A. 4 Reference, Invariance and Consistency 

In the FBST the role of the reference density, r{9) is to make ~ev{H) explicitly invariant 
under suitable transformations of the coordinate system. The natural choice of reference 
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density is an uninformative prior, interpreted as a representation of no information in 
the parameter space, or the hmit prior for no observations, or the neutral ground state 
for the Bayesian operation. Standard (possibly improper) uninformative priors include 
the uniform and maximum entropy densities, see Dugdale (1996) and Kapur (1989) for a 
detailed discussion. Invariance, as used in statistics, is a metric concept. The reference 
density can be interpreted as induced by the information metric in the parameter space, 
dp — d9'G{9)d9. Jeffreys' invariant prior is given by p{9) — ■\/detG{9), see Section E.5. 

In the H-W example, using the notation above, the uniform density can be represented 
by y — [1, 1, 1] observation counts, and the standard maximum entropy density can be 
represented by y = [0, 0, 0] observation counts. 

Let us consider the cumulative distribution of the evidence value against the hypoth- 
esis, V{c) — Pr(ev < c), given 9^, the true value of the parameter. Under appropriate 
regularity conditions, for increasing sample size, n — > oo, we can say the following: 

- If if is false, 9^ ^ H, then W converges (in probability) to 1, that is, y(0 < c < 
1) ^ 0. 

- If if is true, 9^ e if, then V{c), the confidence level, is approximated by the function 

QQ{t,h,c)^Cl{t-h,Cf^{t,c)) , where 

Q(^,^) = ^^^|^, nKx)= Ty'^-'e-ydy, 
r(A;/2,oo) Jo 

t = dim(O), h = dim(if) and Q{k,x) is the cumulative chi-square distribution with k 
degrees of freedom. Figure A. 3 portrays QQ{t, h, c) Q{t — h, Q~^(t, c)) for t = 2 . . . 4 and 
h^0...t-l. 

Under the same regularity conditions, an appropriate choice of threshold or critical 
level, c(n), provides a consistent test, Tc , that rejects the hypothesis if 'ev{H) > c. The 
empirical power analysis developed in Stern and Zacks (2002) and Lauretto et al. (2003), 
provides critical levels that are consistent and also effective for small samples. 




Figure A. 3: Test Tc critical level vs. confidence level 
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Proof of invariance: 

Consider a proper (bijective, integrable, and almost surely continuously differentiable) 
reparameterization ou — (p{0). Under the reparameterization, the Jacobian, surprise, 
posterior and reference functions are: 

r dej_ ddj_ - 

dull ' ' ' duin 



dOn de„ 

-dull ' ' ' doJn - 
~( ^ ^ Pn(^) ^ l'^(^)l 

'^^^ r{<p-^{uj))\J{uj)\ 
Let Qh — It follows that 

s* = sup s{uj) = sup s{9) = s* 

hence, the tangential set, T 4>{T) = T, and 

ev(//) = Lpn{co)dw = f_Pn{e)de = W (//). 

Jt Jt 

Proof of consistency: 

Let V{c) — Pr( ev < c) be the cumulative distribution of the evidence value against 
the hypothesis, given 9. We stated that, under appropriate regularity conditions, for 
increasing sample size, n — > oo, if H is true, i.e. 9 & H, then V{c), is approximated by 
the function 

QQ{t,h,c) ^ q{t - h,Q~Ht,c)) . 

Let 9 and 9* be the true value, the unconstrained MAP (Maximum A Posteriori), 
and constrained (to H) MAP estimators of the parameter 9. 

Since the FBST is invariant, we can chose a coordinate system where, the (likeli- 
hood function) Fisher information matrix at the true parameter value is the identity, 
i.e., J(^°) = /. Prom the posterior Normal approximation theorem, see Section 5 of Ap- 
pendix E, we know that the standarized total difference between 9 and ^° converges in 
distribution to a standard Normal distribution, i.e. 

V^(9- 9") ^ N (0, J(^°)- V(^°) J(^°)-^) = N (0, J{9^)-') = N (0, /) 

This standarized total difference can be decomposed into tangent (to the hypothesis 
manifold) and transversal orthogonal components, i.e. 

dt = dH + dt-h , dt = ^(9- ^°) , 4 = ^{Q* - ^°) , dt_h = ^{9- 9*) . 



J{UJ) 



09 




d<j)-\cu) 


du 




du! 
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Hence, the total, tangent and transversal distances (L^ norms), \\dt\\, \\dh\\ and 
converge in distribution to chi-square variates with, respectively, t, h and t — h degrees of 
freedom. 

Also from, the MAP consistency, we know that the MAP estimate of the Fisher infor- 
mation matrix, J, converges in probability to true value, J{0^)- 

Now, if Xn converges in distribution to X, and Yn converges in probability to Y , we 
know that the pair [X„, y„] converges in distribution to [X, Y]. Hence, the pair [| |, J] 
converges in distribution to [x, J {9^)]^ where a; is a chi-square variate with t — h degrees 
of freedom. So, from the continuous mapping theorem, the evidence value against H, 
ev converges in distribution to e = Q(t, a;), where a; is a chi-square variate with t — h 
degrees of freedom. 

Since the cumulative chi-square distribution is an increasing function, we can invert 
the last formula, i.e., e = Q(t, x) < x < Q~^(t, c). But, since x in a chi-square variate 
with t — h degrees of freedom. 



A similar argument, using a non-central chi-square distribution, proves the other asymp- 
totic statement. 

If a random variable, has a continuous and increasing cumulative distribution func- 
tion, F{x), the random variable u — F{x) has uniform distribution. Hence, the tran- 
formation sev = QQ{t,h,~ev), defines a "standarized e- value", sev = 1 — sev, that can 
be used somewhat in the same way as a p-valuc of classical statistics. This standarized 
e- value may be a convenient form to report, since its asymptotically uniform distribution 
provides a large-sample limit interpretation, and many researchers will feel already fa- 
miliar with consequent diagnostic procedures for scientific hypotheses based on abundant 
empirical data-sets. 



In orthodox decision theoretic Bayesian statistics, a significance test is legitimate if and 
only if it can be characterized as an Acceptance (A) or Rejection (R) decision procedure 
defined by the minimization of the posterior expectation of a loss function, A. Madruga 
(2001) gives the following family of loss functions characterizing the FBST. This loss 
function is based on indicator functions of 6 being or not in the tangential set T: 



The interpretation of this loss function is as follows: If 6* G T we want to reject H, for 6 is 
more probable than anywhere on H; li 9 e T we want to accept H, for 9 is less probable 



Pr(e < c) = QQ{t, h, c) = Q.E.D. 



A. 5 Loss Functions 



A{R,9) 



aI{9^T) , A{A,9)^b + dI{9 eT) 
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than anywhere on H. The minimization of this loss function gives the optimal test: 



Note that this loss function is dependent on the observed sample (via the likelihood 
function), on the prior, and on the reference density, stressing the important point of 
non-separability of utility and probability, see Kadane and Winkler (1987) and Rubin 



This type of loss function can be easily adapted in order to provide an asymptotic in- 
dicator checking if the true parameter belongs to the hypothesis set, I {9'^ G H). Consider 
the tangential reference mass, 



If 7 = 1, m is the reference density mass of the tangencial set. If 7 = m is a pseudo- 
distance from to . Consider also a threshold of form (pi = hm or (^2 = hm/{a + fn), 
a,b > 0, in the expression of the optimal test above. 

If i H, then 9^ 9° and 9* -> 9^*, where 9°* ^ 9^, therefore ||^- ^*|| -> ci > 0. 
But the standarized posterior, pn, converges to a normal distribution centered on 9*^. 
Hence, m — )■ C2 > and (/? — ?> C3 > 0. Finally, since ev (H) — 0, Pr( ev (H) > (/?) — )■ 0. 

If 9^ e H, then 9^9^' and 9* 9^, therefore ||^ - ^*|| ^ 0. Hence, m -> and 
9? — )■ 0. But ev (H) converges to a propper distribution, see section A. 3, and, therefore, 
Pr{ey{H) >ip) ^ 1. 

A. 6 Belief Calculi and Support Structures 

Many standard Belief Calculi can be formalized in the context of Abstract Belief Calcu- 
lus, ABC, see Darwiche and Ginsberg (1992), Darwiche (1993) and Stern (2003). In a 
Support Structure, ($,©,0), the first element is a Support Function, $, on a universe 
of statements, U. Null and full support values are represented by and 1. The sec- 
ond element is a support Summation operator, ©, and the third is a support Scaling or 
Conditionalization operator, 0. A Partial Support Structure, ($,©), lacks the scahng 
operation. 

The Support Summation operator, ©, gives the support value of the disjunction of 
any two logically disjoint statements from their individual support values, i.e.. 



Accept H iff ev {H)>ip^{b + c) /{a + c) . 



(1987). 




1 T 



^{A AB)^ <^>{A y B) = ^A) © *(5) . 



The support scaling operator updates an old state of belief to the new state of be- 
lief resulting from making an observation. Hence it can be interpreted as predicting or 
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propagating changes of belief after a possible observation. Formally, the support scaling 
operator, 0, gives the conditional support value of B given A from the unconditional 
support values of A and the conjunction C = A A B, i.e., 

^a{B) ^^(AaB)0^A) . 

The support unsealing operator reconstitutes the old state of belief from a new state 
of belief and the observation that has led to it. Hence it can be interpreted as explaining 
or back-propagating changes of belief for a given observation. If $ does not reject A, the 
support unsealing operatior, ®, gives the inverse of the scaling operator, i.e., 

^{AaB) = ^a{B)®^A) . 

Support structures for some standard behef calculi are given in Table A.l, where the 
support value of two statements their conjunction are given hy a — ^{A), b — $(-B), 
c = $(C — AA B). In Table A.l, the relation a ^ b indicates that the value a represents 
a stringer support than the value b. Darwiche and Ginsberg (1992) and Darwiche (1993) 
also give a set o axioms defining the essential functional properties of a (partial) support 
function. Stern (2003) shows that the support $(i^) = ev (H) complies with all these 
axioms. 

Table A.l: Support structures for some belief calculi. 





a 




$(A), 6 = $(5), 


c = $(C = 


AAB). 






a®b 





1 a^b 


c a 


a®b 


Calculus 


[0,1] 


a + b 





1 a<b 


c/a 


a X b 


Probability 


[0,1] 


max(a, b) 





1 a<b 


c/a 


a X b 


Possibility 


{0,1} 


max(a, b) 





1 a <b min(c, a) 


min(a, b) 


Classic. Logic 


[0,1] 


a + b-1 


1 


b<a (c - 


a) /{I — a) 


a + b — ab 


Improbablty 


{0..OO} 


min(a, b) 


oo 


b<a 


c — a 


a + b 


Disbelief 



In the FBST, the support values, ^{H) — ev (H), are computed using standard prob- 
ability calculus on which has an intrinsic conditionalization operator. The computed 
evidences, on the other hand, have a possibilistic summation, i.e., the value of evidence 
in favor of a composite hypothesis H = A\/ B, is the most favorable value of evidence in 
favor of each of its terms, i.e., ey (H) = max{ev(74), ev(5)}. It is impossible however 
to define a simple scahng operator for this possibilistic support that is compatible with 
the FBST's evidence, ev , as it is defined. 

Hence, two belief calculi are in simultaneous use in the FBST setup: ev (H) consti- 
tutes a possibilistic partial support structure coexisting in harmony with the probabilistic 
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support structure given by the posterior probability measure, Pn{(^), in the parameter 
space, see Dubois et ah (1993), Delgado and Moral (1987). 

Requirements (V) and (VI), i.e. no ad hoc artifice and possibilistic support, find a rich 
interpretation in the juridical or legal context, where they correspond to the some of the 
most basic juridical principles, see Stern (2003). 

Onus Prohandi is a basic principle of legal reasoning, also known as Burden of Proof, 
see Gaskins (1992) and Kokott (1998). It also manifests itself in accounting through the 
Safe Harbor Liability Rule: 

"There is no liability as long as there is a reasonable basis for belief, ef- 
fectively placing the burden of proof (Onus Probandi) on the plaintiff, who, 
in a lawsuit, must prove false a defendant's misstatement, without making 
any assumption not explicitly stated by the defendant, or tacitly implied by an 
existing law or regulatory requirement. " 

The Most Favorable Interpretation principle, which, depending on the context, is also 
known as Benefit of the Doubt, In Dubito Pro Reo, or Presumption of Innocence, is 
a consequence of the Onus Probandi principle, and requires the court to consider the 
evidence in the light of what is most favorable to the defendant. 

"Moreover, the party against whom the motion is directed is entitled to 
have the trial court construe the evidence in support of its claim as truthful, 
giving it its most favorable interpretation, as well as having the benefit of all 
reasonable inferences drawn from that evidence. " 

A. 7 Sensitivity and Inconsistency 

For a given prior, likelihood and reference density, let rj = cv {H;pq, Lr^,r) denote the 
e- value supporting H. Let rj', rj' . . . denote the e- value with respect to references r', r" . . .. 
The degree of inconsistency of the e-value supporting H, induced by a set of references, 
{r, r', r" . . .} is defined by the index 

Hv,v',v" ■■■} = max {77, 77', 77"... }-min {77, 77', 77"...} 

The same index can be used to study the degree of inconsistency of the e-value induced 
by a set of priors, {po, p'q, Pq . . .}. One could also study the sensitivity of the e-value to a set 
of vitual sample sizes, {In, 7'n, 7"rz . . .}, 7 G [0, 1], corresponding to scalled likehhoods, 
{L , L^' , L^" . . . This intuitive measure of inconsistency can be made rigorous in the 
context of paraconsistent logic and bilattice structures, see Abe et al. (1998), Alcantara 
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et al. (2002), Aricli and Avron (1996), Costa (1963), Costa and Subrahmanian (1989) 
and Costa et al. (1991), (1999). 

The bilatticc B{C,D) = {C x D,<k,<t), given two complete lattices, (C, <c), and 
{D, <d), has two orders, the knowledge order, <k, and the truth order, <t, given by: 

(ci, di) <k (c2, d2) ^ ci <c C2 and di <d d2 
(ci, di) <t (c2, d2) ^ ci <c C2 and 4 <d di 

The standard interpretation is that C provides the "credibility" or value in favor of a 
hypothesis (or statement) H, and D provides the "doubt" or value against H. If (ci, di) <k 
(02,^2), then we have more information (even if inconsistent) about situation 2 than 1. 
Analogously, if (ci, di) <t (c2, ^2), then we have more reason to trust (or believe) situation 
2 than 1 (even if with less information). 

For each of the bilattice orders we define a join and a meet operator, based on the join 
and the meet operators of the single lattices orders. More precisely, and n^, for the 
knowledge order, and Uj and Fit, for the truth order, are defined by the folowing equations: 

(ci, di) Ufe (c2, ^2) = (ci Uc C2, di Ud (^2) , (ci, di) (c2, (^2) = (ci C2, di (^2) 

(ci, di) Ut (c2, (^2) = (ci Uc C2, di Hd da) , (ci, di) (ca, (Z2) = (ci C2, di ^2) 

The "unit square" bilattice, ([0, 1] x [0, 1], <, <) has been routinely used to represent 
fuzzy or rough pertinence relations, logical probabilistic annotations, etc. The lattice 
([0, 1], <) is the standard unit interval, where the join and meet coincide with the max 
and min operators, U — max and □ = min. 

In the unit square bilattice the "truth" , "false" , "inconsistency" and "indetermination" 
extremes are t, /, T, ±, whose coordinates are given in Figure A.4. As a simple example, 
let region it! be the convex hull of the four vertices n, s, e and w, given in Figure A.4. 
Points kj, km, tj and tm are the knowledge and truth join and meet, over r & R. 

In the unit square bilattice, the degree of trust and degree of inconsistency for a point 
X — (c, d) are given by BT {{c,d)) — c — d, and BI ((c, d)) — c + d — 1, a convenient linear 
reparameterization of [0, 1]^, to [—1, +1]^. Figure A.4 also compares the credibility-doubt 
and trust-inconsistency coordinates. 

Let rj — ev{H), and fj — ev(ii) = 1 — ev{H). The point x — {r],fj) in the unit 
square bilattice, represents herein a single evidence. Since Bl{x) — 0, such a point is 
consistent. It is also easy to verify that for the multiple e-values, the definition of degree 
of inconsistency given above, is the degree of inconsistency of the knowledge join of all 
the single evidence points, i.e.. 



/(r?,V,V'...) = Bl((7],7j)U,(V,7?')Ufe(V',r)---) 
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Negation type operators are not an integral part of the bilattice structure but, in 
the unit square, one can define negation as -i {c,d) = (d.c). and confiation as — {c,d) = 
{1 — c,l — d), so that negation reverses trust, but preserves knowledge, and conflation 
reverses knowledge, but preserves trust. 




Figure A. 4: credibility-doubt and trust-inconsistency coordinates 



As an example of sensitivity analysis we use the HW model with the standard uni- 
formative references, the uniform and the maximum entropy densities, represented by 
[1, 1, 1] and [0,0,0] observation counts. For a motivation for this particular analysis, see 
the observations at the end of section E.5. Between these two uninformative references, 
we also consider perturbation references corresponding to [0, 1, 1], [1, 0, 1] and [1, 1, 0] ob- 
servation counts. Each of these references can be interpreted as the exclusion of a single 
observation of the corresponding type from the observed data set. 



Hardy— Weinberg symmetry: Yes 



Hardy— Weinberg symmetry: No 



0.25^*^ ^ 
0.2 - 
r 0.15 - " 
0.1 - 



Figure A. 5: Sensitivity analysis 
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The e-values in the example are calculated using two sample proportions, [,Ti, X2, 2:3] 
= n[l/4, 1/4, 1/2] and = n [1/4, 1/2, 1/4]. The first exhibits the HW hypothesis symmetry, 
the second does not. The log 2 of sample size, log2(^), ranged from 3 to 7. In Figure A. 5, 
the e-values corresponding to each choice of reference, are given by an interpolated dashed 
fine. The interpretation of the vertical interval (sohd bars) between the dashed lines is 
similar to that of the usual statistical error bars. However, the uncertainty represented 
by these bars does not have a probabilistic nature, being rather a possibilistic measure of 
inconsistency, defined in the partial support structure given by the FBST evidence value, 
see Stern (2004). 



A. 8 Complex Models and Compositionality 

The relationship between the credibility of a complex hypothesis, H, and those of its 
constituent elementary hypothesis, H^^'^\ in the independent setup, can be analyzed under 
the FBST, see Borges and Stern (2006) for precise definitions, and detailed interpretation. 

Let us consider elementary hypotheses, H^'^'^\ in k independent constituent models, 
, and the complex or composit hypothesis H, equivalent to a (homogeneous) logi- 
cal composition (disjunction of conjunctions) of elementary hypotheses, in the composit 
product model, M. 

The possibilistic nature of the e-value measure makes it easy to compute the support 
for disjunctive complex hypotheses. Conjunction of elementary hypotheses require a more 
sophisticated analysis. First we must observe that knowing the e-values of the elementary 
hypotheses is not enough to know the e-value of the conjunction; Elementary e-values 
can give only lower and upper bounds to the support for the conjunction. Figure A. 6 
illustrates these bounds, and also the following results, for further details see Borges and 
Stern (2006). For conjunctive compositions, the models' truth functions, , are the key 
element for the required algebraic manipulation, as stated in the next result. 

If H is expressed in HDNF or Homogeneous Disjunctive Normal Form, 

J.J.J=1 -*--*-J=l -*--*-J=l 

then the e-value supporting H is 

ev(/f)= ev (v:^, At, j^'-^O = ^(-^n!.. «•'•■") = 

w(mlxs*'^ = mlxW{s*') = max ev (^/\^_^H^''^^^ = max ev (W) ; 
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where the cumulative surprise distribution of the composite model, W{v), is given by the 
Mellin convolution operation, see Springer (1979), defined as 



The probability distribution of the product of two independent positive random vari- 
ables is the Mellin convolution of each of their distributions. From this interpretation, 
the we immediately see that is a commutative and associative operator. 

Mirroring Wittgenstein, in the FBST context, we can call the e- value, ev[H), the 
cumulative surprise distribution, W{v), and the Mellin convolution operation, (g), respec- 
tively, truth value, truth function, and truth operation. 

Finally, we observe that, in the extreme case of nuU-or-fuU support, that is, when, for 
1 < i < q and 1 < J < /c, s*^^'^^ = or s*^^'^^ = P , the evidence values (or, in this 
context, truth values) of the constituent elementary hypotheses are either or 1, and the 
conjunction and disjunction composition rules of classical logic hold. 

Numerical Aspects 

In appendix G we detail an efficient Monte Carlo algorithm for computing ev(if;p„,r). 
In this algorithm, the bulk of the work consists in generating random points in the pa- 
rameter space, 9^ e ©, and evaluating the surprise function, — s{9^). The Monte 
Carlo algorithm proceeds updating several accumulators based on the tangential set "hit 
indicator" , 



In order to compute a k-step function approximation of W{v), we only have to split 
the surprise range interval, [0, s\ with a vector of k intermediate points, < < < 
. . . s'^ < s* < s'*"'"^ < . . . s*^ < s, and set up a set of vector accumulators based on the vector 
threshold indicator, l'^{9^;pn,r) = l{s{6^) > s'^). Updating the vector accumulators 
usualy imposes only a small overhead on the Monte Carlo algorithm. 

Numerical convolutions of step functions can be easily computed with the help of 
good condensation procedures, see Kaplan and Lin (1987). For alternative approaches 
to numerical convolution see Springer (1979) and Williamson (1989). In the case of 
dependent models, the composite truth function can be solved with the help of analytical 
and numerical copulas, see Cherubini (2004), Mari and Kotz (2001) and Nelsen (2006). 




i<j<ik 




■) = 1{9' e T) = l{s{9^) > s*) . 
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Fig A.6. Subplots 1,2: W\ s*^ , and eY{W), for j = 1,2; 
Subplot 3: (g) W^, s*^s*^, ev {H^ A H"^) and bounds; 
Subplot 4: Structure is an independent replica of M^, 
ev (H^) < ev (H^), but ev {H^ A H^) > ev {H^ A H^). 



Appendix B 



Binomial, Dirichlet, Poisson and 
Related Distributions 

This essay has been pubhshed as Pereira and Stern (2008). 

The matrix notation used in this section is defined in section F.l. 



B.l Introduction and Notation 

This essay presents important properties of the distributions used for categorical data 
analysis. Regardless of the population size being known or unknown, or the specific 
observational stopping rule, the Bernoulli Processes generates the sampling distributions 
considered. On the other hand, the Gamma distribution generates the prior and posterior 
distributions obtained: Gamma, Gamma-Poisson, Dirichlet, and Dirichlet-Multinomial. 
The Poisson Processes as generator of sampling distributions is also considered. 

The generation form of the discrete sampling distributions presented in Section 2 
is, in fact, a characterization method of such distributions. If one recalls that all the 
distribution classes being mixed are complete classes and are Blackwell sufficient for the 
Bernoulli processes, the mixing distributions are unique. This characterization method is 
completely described in Basu and Pereira (1983). 

Section 9 describes the Reny-Aczel characterization of the Poisson distribution. Al- 
though it could be thought as a de Finetti type characterization this characterization 
is based on alternative requirements. While de Finetti chaparcterization is based on 
a permutable infinite 0-1 process, Reny-Aczek characterization is based on a homoge- 
neous Markov process in a finite interval, generating finite discrete Markov Chains. Using 
Reny-Aczel characterization, together with Theorem 4, one can obtain a characterization 
of Multinomial distributions. 
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Section 7 describes the Dirichlet of Second Kind. In this section we also show how to 
use a multivariate normal approximation to the logarithm of a random vector distributed 
as Dirichlet of Second Kind, and a log-normal approximation to a Gamma distribution, 
see Aitchison and Shen (1980). In many examples of the authors' consulting practice these 
approximations proved to be a powerful modeling tool, leading to efficient computational 
procedures. 

The development of the theory in this essay is self contained, seeking a unified treat- 
ment of a large variety of problems, including finite and infinite populations, contingency 
tables of arbitrary dimension, deficiently categorized data, logistic regressions, etc. These 
models also present a way of introducing non parametric solutions. 

The singular representation adopted is unusual in statistical texts. This singular rep- 
resentation makes it simpler to extend and generalize the results and greatly facilitates 
numerical and computational implementation. In this essay, corollaries, lemmas, propo- 
sitions and theorems are numbered sequentially. 

We introduce the following notation for observation matrices, and respective summa- 
tion vectors: 

The tilde accent indicates some form of normalization like, for example, x — {1/1' x)x. 

Lemma 1: li u^,. . .u^ are i.i.d random vectors, 

x = U^'n^ E{x) = nE{u^) and Cov(a;) = nCov{u^) . 

The first result is trivial. For the second result, we only have to remember the transfor- 
mation properties of for the expectation and covariance operators by a linear operation 
on their argument, 

E{AY + 6) = AE{Y) + h , Cov(^r + 6) = ACoy{Y)A' , 

and write 

Cov(x) = Cov([/^-"l) 
= Cov((l'®/) Vec([/^-")) = (1' ® /) (/ ® Cov(mI)) (1 ® /) 
= (1' ® Cov(m^)) (1 ® /) = nCov(M^) . 



B.2 The Bernoulli Process 



Let us consider a sequence of random vectors v},u^,... where, Vw* can assume only two 
values 

"10 
1 



/I 



" 1 " 




' " 




or P = 




_ _ 




_ 1 _ 



where / 
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representing success or failure. That is, -u* can assume the value of any column of the 
identity matrix, I. We say that ti* is of class k, c(ti*) = k, iff ti* = , k & [1, 2]. 

Also assume that (in yoiu opinion), this sequence is exchangeable, that is, if p = 
[p(l),p(2), . . .p{n)\ is a permutation of [1, 2, . . . n], than, Vn,p, 

Pr {v}, ...m") = Pr {uP^^\ . 

Just from this exchangeability constraint, that can be interpreted as saying that the index 
labels are non informative, de Finetti Theorem establishes the existence of an unknown 
vector 



^ G e = {0 < ^ 



< 1 1 1'^ = 1} 



such that, conditionally on ^, v},v?,. . . are mutually independent, and the conditional 
probability of Pr(ii' — I^\9) is 9k, i.e. 

oo 

(li^ n n . . .) I ^ or JJlii I e , and Pr{u' = 7*^ | = . 

i=l 

Vector 9 is characterized as the limit of proportions 

^=lim-a;", = t/^ • "1 = V " . 

n->oo n ■^-^j=l 

Conditionally on 9 , the sequence -u^, -u^, . . . receives the name of Bernoulli process. As 
we shall see, many well known discrete distributions can be obtained from transformations 
of this process. 

The expectation and covariance (conditionally on 9) of any vector in the sequence are: 

• E{u') = 9 , 

• Cov{u') = E{u'^ {u')') - E {u') E {{u')') = diag(^) -9®9' . 



When the summation domain 1 : n, is understood, we may use the relaxed notation x 
instead of x^. We also define the Delta operator, or "pointwise power product" between 
two vectors of same dimension: Given 9, and x, n x 1, 

n 

9Ax = J[{9iY' . 

i=l 

A stopping rule, S, establishes, for every n — 1,2,..., a decision of observing (or not) 
u^'^^, after the observations u^,. . .u^. 

For a good understanding of this text, it is necessary to have a clear interpretation of 
conditional expressions like | n or X2 | x". In both cases we are referring to a unknown 
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vector, x", but with a different partial information. In the first case, we know n, and 
therefore we know the sum of components, x" + X2 = n; however, we know neither 
component nor X2. In the second case we only know the first component, of x", x", 
and do not know the second component, X2, obviously we also do not know the sum, 
n — x'l + Xg. Just pay attention: We list what we know to the right of the bar and, 
(unless we have some additional information) everything that can not be deduced from 
this list is unknown. 

The first distribution we are going to discuss is the Binomial. Let 5{n) be the stopping 
rule where n is the pre-established number of observations. The (conditional) probability 
of the observation sequence C/^ " " is 

Px(u^'-''\e) = eAx" . 

The summation vector, x", has Binomial distribution with parameters n and 9, and 
we write x" | [n,6] ~ Bi(n, ^^). When n (or S{n)) is implicit in the context we may write 
X I 9 instead of | [n, 9]. The Binomial distribution has the following expression: 

Pr(a;" \n,9) = ( ) (^ A x") 



X" 



where 



n \ Tin + 1) n\ . ^, 

' ~ — and n — 1 X 



X J r(xi + 1) r(x2 + 1) Xi\x2\ 



A good exercise for the reader is to check that expectation vector and the covariance 
matrix of | [n, 9] have the following expressions: 



E(x") = n9 and Cov(x") =n{9Al) 



1 -1 
-1 1 



The second distribution we discuss is the Negative Binomial. Let (5(x") be the rule 
establishing to stop at observation when obtaining a pre-established number of 
successes. The random variable X2, the number of failures he have when we obtain the 
required x" successes, is called a Negative Binomial with parameters x^ and 9. It is 
not hard to prove that the Negative Binomial distribution X2 | [xi,9] ~ NB(x",^), has 
expression, V X2 G iV, 

Pr(x"|x^,^) = ^ ( ) (^Ax") = ^iPr((x"-/^) I (n- 1),^)) . 

Note that, from the definition this distribution, x\ is a positive integer number. Nev- 
ertheless, we can extend the definition above for any real positive value a, and still obtain 
a probability function. For this, we use 

E T^rf " ""^^ = TT-" , Va e [0, oo[ and tt e]0, 1[ . 
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The reader is asked to check the last equation, as well as the following expressions for the 
expectation and variance of 



In the special case of S{xi =1), the Negative Binomial distribution is also known as 
the Geometric distribution with parameter 6. If a random variables are independent and 
identically distributed (i.i.d.) as a geometric distribution with parameter 9, then the sum 
of these variables has Negative Binomial distribution with parameters a and 6. 

The third distribution studied in this essay is the Hypcrgcomctric. Going back to 
the original sequence, assume that a first observer knows the first N obser- 

vations, while a second observer knows only a subsequence of n < of these observa- 
tions. Since the original sequence, u^,u^,..., is exchangeable, we can assume, without 
loss of generality, that the subsequence known to the second observer is the subsequence 
of the first n observations, u^, . . .u"' . Using de Finetti theorem, we have that x"' and 

_ _ jjn+i : N-^ Conditionally independent, given 9. That is, 11 {x^ — x") | 9. 
Moreover, we can write 



Our goal is to find the distribution function of | x . Note that x is sufficient for 
jji . N gjygj^ 0^ g^j^^ sufficient for ■ Moreover x"' \ [n, x^] has the same distribution 
of I [n,x'^,9]. Using the basic rules of probability calculus and the properties above, 
we have that 




and yar{x^\xl9) 



I [n, 9] ~ Bi(n, 9) , x^ \ [N, 9] ~ Bi(iV, 9) , and 



{x^ - x") I [{N -n),9]^ Bi{N -n,9) . 



PT{x''\n,x^,9) 



Pt{x'',x^ \n,N,9) _ J 

Pr(a;^ | n, A^, 9) 
Pr(x'" I n, N, 9) Pi{x^ - 



Pr(a;", {x^ - x") | n, N, 9) 

Pr(x^ I n, TV, 9) 
■ I n, N, 9) 



Pv{x^ I n, N, 9) 



Hence, | [n, x^] 



has distribution function 



Pi{x''\n,x^) 





where Q < x"" < x^ < Nl , I'x" = n , I'x^ = N . 
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This is the vector representation of the Hypergeometric probabihty distribution. 

Hy(n,iV,a;^) . 



X \ in,x 



The reader is asked to check the following expressions for the expectation and (condi- 
tional) covariance of | [n, N, x^], and covariance of and u\ i,j < n: 



n 



E(a;") = ^ a;^ and Cov(a;") = (x^ A 1) 



1 -1 
-1 1 



Coy{u\u^\x^) 



1 



{N - 1)7V2 



{x^ A 1) 



-1 1 
1 -1 



We finish this section presenting the derivation of the Beta-Binomial distribution. Let 
us assume that the first observer observed X2 failures, until observing a pre-established 
number of x^ successes. A second observer makes more observations, observing X2 failures 
until completing the pre-established number of x^ successes, Xi < xf . 

Since and Xi are pre-established, we can write 



x^\er. NB(xf ,9) , 41^- NB(x^, 9) 



{xq -xl)\9^ NB(xf -x1,9) and x^ U {x^ -xl)\9 . 

As before, our goal is to describe the distribution of X2 \ [x^^x^]. If one notices that 
'x'^, x^] is sufficient for [x"', {x^ — x")], with respect to 9, the problem becomes similar to 



the Hypergeometric case, and one can obtain 

x^\T{x^) T{x'^ + x1) T{x^ -x'i + x^ -x-^] 



Py{xI\xI,x^) 



r(x^ + xf) x^\V{x'l) {x^ -xf)\V{x^ -xD ' 
x^e{0,l,...,a;^}. 



This is the distribution function of a random variable called Beta Binomial with param- 
eters and x^ . 

x^ I «, x^) ~ BB(a;^, x^) . 

The properties of this distribution will be studied in the general case of the Dirichlet- 
Multinomial, in the following sections. 

Generalized categories for k > 2 can be represented by the orthonormal base . . . , 

i.e., the columns of the A;-dimensional identity matrix. The Multinomial and Hypergeo- 
metric multivariate distributions, presented in the next sections, are distributions derived 
of this basic generalization. 
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B.3 Multinomial Distribution 

Let u\ i = 1,2,... be random vectors with possible results in the set of columns of the 
m-dimensional identity matrix, /c e 1 : m. We say that is of class k, c(m*) = k, iff 

Let 6* G [0, 1]™ be the vector of probabilities for an observation of class A; in a m-variate 
Bernoulli process, i.e., 

pj:(u' ^i^\e)^ek , < ^ < 1 , I'e^i . 

Like in the last section, let U 

U^[v},u^,..] and = [/^""l . 

Definition: If the knowledge of 9 makes the vectors -u* independent, then the (condi- 
tional) distribution of given 9 is the Multinomial distribution of order m with param- 
eters n and 6*, given by 



Pr(x"|n,^) = ^„ ) (^Ax"^ 



where 

n \ Tin + 1) n! 



and n — I'x . 



xj r(xi -Fl) . . . r(x^ -Fl) xi\...xj. 
We represent the m-Multinomial distribution writing 

x''\[n,9] - Mnm{n,9) . 

When m = 2, we have the binomial case. 

Let us now examine some properties of the Multinomial distribution. 

Lemma 2: If x | ^ ~ Mnm{n,9) then the (conditional) expectation and covariance of x 
are 

E{x) = n9 and Cov(a;) = n(diag(^) - 9 9') . 

Proof: Analogous to the binomial case. 

The next result presents a characterization of the Multinomial in terms of the Poisson 
distribution. 

Lemma 3: Reproductive property of the Poisson distribution. 

Xi - Ps(Ai) ^ I'x I A - Ps(l'A) . 
that is, the sum of (independent) Poisson variates is also Poisson. 
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Theorem 4: Characterization of the Multinomial by the Poisson. 

Let vector with independent Poisson distributed components with 

parameters in the known vector A = [Ai, Xm]' > 0. Let n be a positive integer. Then, 
given A, 

X I [n = I'x, A] ~ Mn^(n, 9) where 9 = zyr-A . 

1 A 

Proof: The joint distribution of x, given A is 

m 

Pt{x I A) = 



k=l 



Using the Poisson reproductive property, 
Pr {x I I'x = n, A) 



Pr (I'x = n A a; I A) , , Pr(a; I A) 

— o{n — 1 X)- 



Pr(l'x = n|A) ^ ^Pr(l'x = n|A) ' 

The following results state important properties of the Multinomial distribution. The 
proof of these properties is simple, using the characterization of the Multinomial by the 
Poisson, and the Poisson reproductive property. 



Theorem 5: Multinomial Class Partition 

Let 1 : m be the index domain for the classes of a order m Multinomial distribution. Let 
T be a partition matrix breaking the m-classes into s-super-classes. Let x ~ Mn^(n, 9), 
then y^Tx^ Mn,{n,T9). 



Theorem 6: Multinomial Conditioning on the Partial Sum. 

If a; ~ M'n.m{n,9)j then the distribution of part of the vector x conditioned on its sum 
has Multinomial distribution, having as parameter the corresponding part of the original 
(normalized) parameters. In more detail, conditioning on the t first components, we have: 

xi:t I {i'xi:t^j) ~ Mnt (^j, ^J^^i'-t^ where < j < n . 



Theorem 7: Multinomial-Binomial Decomposition. 
Using the last two theorems, if x ~ Mn^(n, 9), 



PT{x\n,9) = 

Pr (xt+i:m\{n-j),-TJ^ ^t+i:m) 

V -I- f t+1 : m J 



Pr 



i 




■V9^:t 


|n. 




. - i) . 
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Analogously, we could write the Multinomial- Trinomial decomposition for a three- 
partition of the class indices in three super-classes. More generally, we could also write 
the m-nomial-s-nomial decomposition for the partition of the m class indices into s super- 
classes. 

B.4 Multivariate Hyper geometric Distribution 

In the first section we have shown how an Hypergeometric variate can be generated from 
a Bernoulli process. The natural generalization of this result is obtained considering a 
Multinomial process. As in the last section, we say that is of class k, c{u^) = k, iff 
= I''. 

We take a sample of size n from a finite population of size A^(> n), that is partitioned 
into m classes. The population frequencies (number of elements in each category) are 
represented by [ipi, . . .ipm], hence = I'lp. Based on the sample, we want to make an 
inference on ip. Xk e is the sample frequency of class k. 

One way of describing this problem is to consider an urn with A^ balls of m different 
colors, indexed by 1, . . .m. V^jt is the number of balls of color k. Assume that the N 
balls are separated into two smaller boxes, so that box 1 has n balls and box 2 has the 
remaining N — n balls. The statistician can observe the composition of box 1, represented 
by vector x of sample frequencies. The quantity of interest for the statistician is the vector 
ip — X representing the composition of box 2. 

As in the bivariate case, we assume that C/^"^ is a finite sub-sequence in an ex- 
changeable process and, therefore, any sub-sequence extracted from ■ ^ has the same 
distribution of t/^"". Hence, x — U^ '-'^l has the same distribution of the frequency vector 
for a sample of size n. 

As in the bivariate case, our objective is to find the distribution oi x\ip. Again, using 
de Finetti theorem, there is a vector < ^ < 1 , 1'9 — 1, such that YljLo'^^ I ^ 
Pr {c{u^) ^k)^9k . 

Theorem 8: As in the Multinomial case, the following results follow: 

. V'|^-Mn„(A^,^) ; 

• X I 61 ~ Mn^(n, 9) ; 

. (^lj-x)\e r^MnmHN -n),9) ; 

• {'iIj-x)Ux\9 . 

Using the results of the last section and following the same steps as in the B.J2 case 
in the first section, we obtain the following expression for m-variate Hypergeometric 
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distribution, x"' \ [n, N, ip] ~ Hy^(n, N, ip) : 



Pr(x"' I n, -0) = 



n \ f N — n 
N 



where Q <x^ <N1 , I'x^ = n , iV = AT . 
This is the vector representation of the Hypergeometric probabihty distribution. 

I [n, x^] ~ Hy(n, iV, x^) . 
Alternatively, we can write the more usual formula. 



Pr(a; | ip) — 



AT 
n 



Theorem 9: The expectation and covariance of a random vector with Hypergeometric 
distribution, x ~ Hy^(n, A^, ■i/^), are: 

_ j\[ — ji / ~1 

E(a;) = , Cov(a;) = n-^^^^ — - ^diag('0) — ip ® ip'j where ip = —ip . 

Proof: Use that 

Cov(x") = nCov(M^) +n(n- 1)Cov(m\m2) 

Cov(m^) = E (m^ ® (u^y) - E(m^) ® E{u^y = diag(^) - ^® ^' 

Cov(«\ M^) = E ® (u^y) - E{u^) ® E(m2)' . 



The second term of the last two equations are equal, and the first term of the last equation 
is 

f tiizl if ^ 
p /„l„2^ _ J Af N-1 ' J 

AT AT-l 1^ ' 7^ J 

Algebraic manipulation yields the result. 

Note that, as in the order 2 case, the diagonal elements of Cov(-u^) are positive, while 
the diagonal elements of Cov(m^,m^) are negative. In the off diagonal elements, the signs 
are reversed. 
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B.5 Dirichlet Distribution 

In the second section we presented the muhinomial distribution, Mnm(n, 9). In this section 
we present the Dirichlet distribution for the parameter 6. Let us first recall the univariate 
Poisson and Gamma distributions. 

A random variable has Gamma distribution, x \ [a, b] ~ G{a, b),a,b > 0, if its distri- 
bution is continuous with density 

f{x \a,b) — i^^a;""^ exp{—bx) , x > . 

The expectation and variance of this variate are 

E(x) = Y and Varfa;) = ^ . 
b b^ 

Lemma 10: Reproductive property for the Gamma distribution. 
If n independent random variables Xi | Oj, 6 ~ G{ai, b), then 

I'x r^G{l'a,b) . 

Lemma 11: The Gamma distribution is conjugate to the Poisson distribution. 
Proof: 

If y I A ~ Ps(A) and A has prior A | a, 6 ~ G{a, b), then 

f{\\y,a,b) oc L{\\y)f{X) 

= exp(-A)^ -^-^""^expf-^A) oc Xy+^-^expi-lb + 1)X) . 
y\ r(a) 

That is, the posterior distribution of A is Gamma with parameters [a + y,b + 1]. 



Definition: Dirichlet distribution. 
A random vector 

y e Srn-1 = {ye R"' 1 < y < 1 a I'y = 1} 
has Dirichlet distribution of order m with positive a e i?"* if its density is 

yA(a-l) 



Pr(y I a) 



B{a) 



Note that iSm-i, the m — 1 dimensional Simplex, is the region of i?™ subject to the 
"constraint", Vy = 1. Hence, a point in the Simplex has only m — 1 "degrees of freedom". 
In this sense we say that the Direchlet distribution has a "singular" representation. It 
is possible to give a non-singular representation to the distribution [yi, . . . ym-i]', known 



274 



APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS 



as the Multivariate Beta distribution, but at the cost of obtaining a convoluted algebraic 
formulation that also loses the natural geometric interpretation of the singular form. 

The normalization factor for the Dirichlet distribution is 

B{a)= f {yA{a-l))dy. 



Lemma 12: Beta function. 

The normalization factor for the Dirichlet distribution defined above is the Beta function, 
defined as 



^<<'>- r(l'a) 

The proof is given at the end of this section. 



Theorem 13: Dirichlet as Conjugate of the Multinomial: 
If ^ ~ Di„(a) and a; | 6* ~ Mn^(n, 9) then 

6\x Di^(a + x) . 



Proof: 

We only have to remember that the Multinomial hkehhood is proportional to 9 Ax, 
and that a Dirichlet prior is proportional to ^ A (a — 1). Hence, the posterior is propor- 
tional to ^ A (x + a — 1). At the other hand, B{a + x) is the normalization factor, i.e., 
equal to the integral on ^ of ^ A (x + a — 1), and so we have a Dirichlet density function, 
as defined above. 



Lemma 14: Dirichlet Moments. 
If ^ ~ Di^(a) and peN"", then 



Proof: 



[{9Ap)f{9\a)d9 = -^ [ {9Ap){9A{a-l))d9 
Je B{a) Jq 

1 /■«,A(„+p-l))d»= ?|±ii. 

B{a) Jq B[a) 
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Choosing the exponents, p, appropriately, we have 
Corolleiry 15: If ^ ~ Dim(o) , then 

E(^) = a = z^a 
Cov{9) — —J-— — (diag(a) — a a') . 

Theorem 16: Characterization of the Dirichlet by the Gamma: 

Let the components of the random vector x e be independent variables with distri- 
bution G{ak,b). Then, the normalized vector 

y — zTf—x ~ Diyn{a) , I'x ~ Ga(l'a) and y U I'x . 



Proof: 



Consider the normalization, 

y = 



t 



as a transformation of variables. Note that one of the new variables, say 
y^ = t{l — yi . . . — ym-i), becomes redundant. 

The Jacobian matrix of this transformation is 



J 



(■'•!■ ■'•2- ■ ■ ■ ■fni-L-'I'm) 

d{yi,y2,---ym-i,t) 



t 


■■ 


■ 


yi 





t ■■ 


■ 


y2 





•• 


• t 


ym—i 


. -t 


-t •• 


■ -t 


1 - ?/i ym-i 



By elementary operations (see appendix F) that add all rows to the last one, we obtain 
the LU factorization the Jacobian matrix, J — LU, where 



1 

1 



-1 -1 







1 
-1 1 



and U 



t 
t 








yi 
y2 



t ym-i 
1 



A triangular matrix determinant is equal to the product of the elements in its main 
diagonal, hence \ J\ = \L\ \U\ — It"^"^. 
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At the other hand, the joint distribution of x is 

fix) = l[Gs^{xk\ak,b) = II e-'^'^ix^r''-' . 

k=l k=l ^^''^ 

and the joint distribution in the new system of coordinates is 

9{[y,t]) = \J\f{x~\[y,t])) 

k=i ^ ^^^> k=i ^^''^ 

ii r(a.) ) Via,) ) 

Hence, the marginal distribution y = [yi, . . .yk\' is 



/•oo 

g{y)^ J 9{[y,t])dt 



r(afe) J ' ' B{a) 



In the last passage, we have replaced the integral by the normalization factor of a Gamma 
density, Ga(l'a, b). Hence, we obtain a density proportional to |/ A (a — 1), i.e., a Dirichlet, 
Q.E.D. 

In the last passage we also obtain the Dirichlet normalization factor, prooving the 
Beta function lemma. 



Lemma 17: Bipartition of Indices for the Dirichlet. 

Let 1 : t, t + 1 : m be a bipartition of the class index domain, 1 : m, of an order m Dirichlet, 
in two super-classes. Let y ~ Di^(a), and 



1 



-yt+1 -.m , w 



If J. . (, 7 — if 

We than have, z^Hz^Hw and 

z'^ ~ Dit(ai:t) , z^ ~ Di^_t(at+i:^) and w ~ Di2 



l'yi:t 



l'ai:t 
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Proof: 

From the Dirichlct characterization by the Gamma we can imagine that the vector y is 
built by normahzing of a vector x, as follows, 

^ m 

y = zn-x , Xk ~ Ga(afc, b) , JJ . 

k=l 

Considering isolatetly each one of the super-classes, we build the vectors and z"^ that 
are distributed as 

z = T7—yi:t = -TT—Xx-t ~ Dit(ai:t) 
^Vwt ^X\\t 

2 _ 1 _ 1 T^- I \ 

J- ?/t+i : m J- : m 

z^ U z"^, that are in turn independent of the partial sums 

I'xi : t ~ Ga(l'ai :t,b) and I'xt+i : m ~ Ga(l'at+i : ^, 6) . 



Using again the theorem characterizing the Dirichlet by the Gamma distribution for 
these two Gamma variates, we obtain the result, Q.E.D. 

We can generalize this result for any partition of the set of classes, as follows. If 
y ~ Dim(a) and T e is a s-partition of the m classes, the intra and extra super-class 
distributions are independent Dirichlets, as follows 

ry 

w ^ Ty Di, (Ta) . 



B.6 Dirichlet-Multinomial 



We say that a random vector x e iV" | I'x — n has Dirichlet-Multinomial (DM) distribu- 
tion with parameters n and a e RP^, iff 

, I . B(a + x) f n \ B(a + x) 
Pr(x \ n,a) — 



B{a) \x J B{a)B{x) xAl 



Theorem 18: Characterization of the DM as a Dirichlet mixture of Multinomials. 

Se ^ ~ Dim{a) and x\9 ^ Mn(n, 9) then x \ [n, a] ~ DMm{n, a) . 

Proof: 
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The joint distribution of 6, x is proportional to A (a + x — 1), which integrated on 9 
is B{a + x). Hence, multiplying by the joint distribution constants, we have the marginal 
for X, Q.E.D. Therefore, we have also proved that the function DM is normalized, that is 



pt{x) = f ( ^){eAx)r^eA{a-i)de 



^ I ^ \ f //I A / -^ w rn B(x + a) ( n 

B{a) \ x J Joes^_, ^ ^ " B{a) \ x 

Theorem 19: Characterization of the DM by m Negative Binomials. 

Let a G iV™, and x G iVm, be a vector whose components are independent random 

variables, ~ NB(afe,6'). Then 

X I [I'x = n, a] ~ DMj„(n, a) . 



Proof: 



Pr(x|^,a) = J]( "'^ + '^^^"^ jr^l-^ffc 
fc=i ^ ' 



Then, 



Pr(x I Vx — n, 9, a) 



Prfxla.^) iifc-i » 



ak + Xk-l 



Pr{l'x = n\e) f I'a + I'x -I 

I'x 



Hence, 



Pr(a; | I'x — n, 9, a) — Pt{x \ I'x — n, a) 

r{ak + Xk) , r(l'a + n) _ B{a^-x) ( n 



x\r{ak) r(l'a)n! B{a) \ x 



Theorem 20: The DM as Pseudo- Conjugate for the Hypergeometric 

Sex Hy^(n, N, ip) and ip ~ DM^(A^, a) then {ip-x)\x DM„(7V -n,a) . 

Proof: Using the properties of the Hypergeometric already presented, we have the inde- 
pendence relation, {ip ~ x) Ux\9. We can therefore use the Multinomial sample x \ 9 for 
updating the prior and obtain the posterior 

9\x ^ Dim(fl + x) . 
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Hence, the distribution of the non sampled pat of the population, ip — x, given the sample 
X, is a mixture of {ip — x)9 buy the posterior for 9. By the characterization of the DM as 
a mixture of Multinomials by a Dirichlet, the theorem follows, i.e., 

{il^ - x)\[e,x\ ^ {tl; - x)\e ^M^m{N - n,e) 
6\x Dim(a + x) 

(■^ — x) I X ~ 'Dim{N — n, a + x) . 

Lemma 21: DM Expectation and Covariance. 
If x~DM^(n, a) then 

E(a;) = na = z^cl 
Cov(x) = ^ _^ (diag(a) - a (g) a ) 



Proof: 



E(x) = E,(E,(x|^^)) = E,(n^) = na 

E(x(»x') = Ee(E^(a;®x'|^)) 

= Ee ( E(,T I B) ® E(a; | + Cov(a; | Q)) 

= Ee (n (diag(^) - ^ ® + ® 

^ nEe (diag(^)) + n{n - 1) Eei9 ® 9') 

= ndiag(a) + n(n - 1) ( E(^) ® E(^)' + Cov{e)) 

— ndiag(a) + n{n — 1) (8) a' + ^ (diag(a) — a (g) a')^ 

= ndiag(a) + n(n - 1) ( -7— diag(a) + / ° a (g) a' j 

Cov(a;) = E(x (g) a;') - E(x) (g) E(x)' = E(a; (g) a;') - n^a (g) a' 



n + ^/^j^-^ j diag(a) + yn{n - 1) _^ ^ - n J a (g) a' 

n(n + I'a) , , _ ^ ^ _ 

- ^ I (diag(a) - a Q.E.D. 

Theorem 22: DM Class Bipartition 

Let l:t, t + l:m a bipartition of the index domain for the classes of an order m DM, 
1 : m, in two super-classes. Then, the following conditions (i) to (iii) are equivalent to 
condition (iv): 

i: Xi:t n Xt+l:m \ rii = l'Xi:t ] 

ii-1: xi:t I ni = I'xi-t ~ DMt(ni, ai.,t) ; 
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ii-2: xt+i:m I n2 = I'xt+i-.m ~ DM^_t(n2, at+i:m) ; 



iii: 



ni 


- DM2 (^n, 




) 






l'ot+l:m 





iv: X ~ DMm(n, a) . 

Proof: We only have to show that the joint distribution can be factored in this form. By 
the DM characterization as a mixture, we can write it as Dirichlet mixture of Multinomials. 
By the bipartition theorems, we can factor both, the Multinomials and the Dirichlet, so 
the theorem follows. 

B.7 Dirichlet of the Second Kind 

Consider y ~ Di^_,_i(a). The vector z = {l/ym+i)yi:m has Dirichlet of the Second Kind 
(D2K) distribution. 

Theorem 23: Characterization of D2K by the Gamma distribution. 

Using the characterization of the Dirichlet by the Gamma, we can write the D2K variate 

as a function of m + 1 independent Gamma variates, 

zi:m ^ {i/xm+ijxi-.m whcrc Xk Ga{ak,b) . 

Similar to what we did for the Dirichlet (of the first kind), we can write the D2K 
distribution and its moments as: 

\ \ z A{ai:m - 1) 
fiz a) — ^ — -7 , 

^ ' ^ {1 + 1'z)^ - B{a) 
E{z) = e = {l/am+i)ax:rn , 
Cov(^) = ^ (diag(e) + e®e') . 

flm+l ~ 2 

The logarithm of a Gamma variate is well approximated by a Normal variate, see 
Aitchison & Shen (1980). This approximation is the key to several efficient computational 
procedures, and motivates the computation of the first two moments of the log-D2K 
distribution. For that, we use the Digamma, ■0( ), and Trigamma function, ■0'( ), defined 
as: 

, / V d , ^, , r'(a) ,, , , d , , , 

Lemma 24: The expectation and covariance of a log-D2K variate are: 

£'(log(2;)) = V(«i:m) - ^(a„+i)l , 
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Cov(log(^)) = diag(V''(ai:r„)) + V^'lom+i)! ^ 1' ■ 
Proof: Consider a Gamma variate, x ~ G(a, 1) : 



1—1 f{x)dx— I ——x"'~^ex^{—x)dx 
lo Jo r(a) 



Taking tlie derivative with respect to parameter a, we have 

poo 

= / ln(a;)x"-^ 
Jo 



r(a) r2(a 
Taking the derivative with respect to parameter a a second time, 



d /-^ 


ln(a;^ 




-1 


da Jq 


r(a) 




_iexp(-a;) 


dx — 


r' 


:«) 


r(a) 


r( 


a) 



^(ln(x)) 



ln(x)^x' 

= E{ln{xf) - E{ln{x)f = Var(ln(a;)) . 
The lemma follows from the D2K characterization by the Gamma. 

B.8 Examples 

Example 1: Let A, B he two attributes, each one of them present or absent in the 
elements of a population. Then each element of this population can be classified in 
exactly one of 2^ = 4 categories 

A B k I'' 

present present 1 [1,0,0,0]' 

present absent 2 [0,1,0,0]' 

absent present 3 [0,0,1,0]' 

absent absent 4 [0,0,0,1]' 

According to the notation above, we can write x\n,9 ^ Mn4(n, 9). 
Ii9 = [0.35, 0.20, 0.30, 0.15] and n = 10, then 



,10 







^ 10 \ 


\n,9 







'Ax 



10\ 



Hence, in order to compute the probability oi x — [1, 2, 3, 4]' given 9, we use the expression 
above, obtaining 



Pr 



" 1 " 




' 0.35 " 


2 




0.20 


3 




0.30 


4 




0.15 



\ 



0.000888 
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Example 2: If X | ^ ~ Mn3(10,^), 9 = [0.20, 0.30, 0.15], one can conclude, using the 
result above, that 

E(X) = (2,3,1.5) , 



while the covariance matrix is 



1.6 -0.6 -0.3 
-0.6 2.1 -0.45 
-0.3 -0.45 1.28 



Example 3: Assume that X | 6* ~ Mn3(10,6'), with 9 = [0.20,0.30,0.15], as in Example 
2. Let us take Ao = {0, 1}, Ai = {2, 3}. Then, 

J2Xi\^^^2 + Xs\9 ^ Mni(10, 92 + 9^) , 



or 



X2 + X3|^~Mni(10,0.45) . 

Analogously, 

Xo + Xil^ ~ Mni(10,0.55) , 
Xi + Xal^ ~ Mni(10,0.35) , 
X2\9 ~ Mni(10,0.30) . 



Note that, in general, \i X\9 Mnfc(n, 9) then Xj | ^ ~ Mni(n, 9i), i — 1, k. 
Example 4: 3x3 Contingency Tables. 

Assume that X\9 Mn8(n, 9), as in a 3x3 Contingency Tables: 



Xii 


Xl2 


Xl3 


Xu 


X21 


X22 


X23 


X2, 


Xzi 


XZ2 


X33 


X3, 




X,2 


X,3 


n 



Applying Theorem 5 we get 

(Xi., X2,) I 9 - Mn2(n, 9'), 9' = {9u, 92.), 9'^ = ^3 
This result tell us that 



with 



{Xn,Xi2,Xi3)\9r.Mns(n,9'^ , 
hi: ^i2, 9i3) , 9q^ — 1 — 9i, , i = 1, 2, 3 . 
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Wc can now apply Theorem 6 to obtain the probabihty distribution of each row of the 
contingency table, conditioned on its sum, or conditioned on the sum of the other rows. 
We have 



with 

- — 7i ' ^Oi - 



i l, (^12) gl _ OjZ 



The next result expresses the distribution of X | ^ in term of the conditional distri- 
butions, of each row of the table, in its sum, and in term of the distribution of these 
sums. 



Proposition 25: li X\9 ^ Mnj.2_i(n, 9), as in an r x r, contingency table, then P[X \ 9) 
can be written as 



P{X I 9) = 



,1=1 



P{Xu, I 9) . 



Proof: We have: 

r 

P{X\9) = n\\[ 

i= 
r 

n 



nXll nxrr 

1=1 



1=1 



9ii 

Xii\ ... Xij\ 



9i» 



ni 



Ui, ... P^, 



From Theorems 5 and 6, as in the last example, we recognize each of the first r factors 
above as the probabilities of each row in the table , conditioned on its sum, and recognize 
the last factor as the joint probability distribution of sum of these r rows. 



Corolleiry 26: If X | ^ ~ Mnr2_i(n, 9), as in Theorems 5 and 6, then 

r 

P{X I Xu, Xr-1„ ^) = n -^(^^1' --^^hr-l I ^) 



i=l 



and, knowing 9, Xi,, Xr—i», 

{Xii, ...,Xi^r-l) n ... n {Xrl, ...,Xr^r-l) ■ 
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Proof: Since 

P{X I 6) = P{X I xu, Xr-u, e)P{Xu, X2., ...,Xr-u \ 9) , 
from Theorems 5 and 6 we get the proposed equahty. 



The following result will be used next to express Theorem 7 as a canonical represen- 
tation for p{x\e). 



Proposition 27: li X \ 6 Mn^2_i(n, 6), as in Proposition, then a transformation 

T : {9ii, 9ir, 9rl, Or,r-l) (^11, Ai^^-l, Kl, K,r-l,'nij ■■■i Vr-l) 

given by 



X — ^11 \ 



r-1 



01. 



?7l = di,, ri2 = 02,, ...,r]r-l = 6'(r_l), 

is a onto transformation defined in {0 < 9u + ••• + dr,r-i < 1 ; < 9ij < 1} over the 
unitary cube of dimension — 1. Moreover, the Jacobian of this transformation, t, is 



J = V'' ^ V[ ^ ••• Vr-l (1 - ^1 - - - V: 



r-1 



The proof is left as an exercise. 
Example 5: Let us examine the case of a 2 x 2 contingency table: 



Xii 


Xl2 


X21 


X22 



9n 


9l2 


6*21 


922 



n 



In order to obtain the canonical representation of P{X \ 9) we use the transformation T 
in the case r — 2: 



A 



9 



11 



11 



A^ 



^11 + ^12 
^^11 



21 



^21 (722 

r]! = 6^11 + 9i2 , 
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hence, 



P{X I 9) = 



xn J \ X21 / \ xu 

O<0u<l , < ^21 < 1 , < 771 < 1 . 



B.9 Functional Characterizations 

The objective of this section is to derive the general form of a homogeneous Markov ran- 
dom process. Theorem 28, by Reny and Aczel, states that such a process is described by a 
mixture of Poisson distributions. Our presentation follows Aczel (1966, sec. 2.1 and 2.3) 
and Janossy, Reny and Aczel (1950). It follows from the characterization of the Multino- 
mial by the Poisson distribution given in theorem 4, that Reny-Aczel characterization of 
a homogeneous and local time point process is analogous to de Finetti characterization 
of an infinite exchangeable 0-1 process as a mixture of Bernoulli distributions, see for 
example Feller (1971, v.2, ch.VII, sec. 4). 

Cauchy's Functional Equations 

Cauchy's additive functional equation has the form 

f{x + y)^f{x) + f{y) . 

The following argument from Cauchy (1821) shows that a continuous solution of this 
functional equation must have the form 

f{x) = cx . 

Repeating the sum of the same argument, x, n times, we must have f{nx) = nf{x). 
If x = {m/n)t, then nx = mt and 

nf{x) — f{nx) — f{mt) — mf{t) hence, 
\n / n 

taking c = /(I), and x = m/n, it follows that f{x) = cx, over the rationals, x E Q. From 
the continuity condition for /(x), the last result must also be valid over the reals, x & R. 
Q.E.D. 

Cauchy's multiplicative functional equation has the form 



f{x + y) = f{x)fiy), Wx,y>0 ,f{x)>Q 
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The trivial solution of this equation is f{x) ~ 0. Assuming f{x) > 0, we take the 
logarithm, reducing the multiplicative equation to the additive equation, 

In f{xy) = In f{x) + In f{y) , hence, 

ln/(a;) — cx ,or f{x) — exp(cx) . 

Homogeneous Discrete Markov Processes 

We seek the general form of a homogeneous discrete Markov process. Let w^it), for t > 0, 
be the probability of occurrence of exactly k events. Let us also assume the following 
hypotheses: 

Time Locality: If ti < t2 < ts < then, the number of events in \ti, is independents 
of the number of events in [ts, ti[. 

Time Homogeneity: The distribution for the number of events occurring in [ii,t2[ 
depends only on the interval length, t = t2 — ti. 

From time locality and homogeneity, we can decompose the occurrence of no (zero) 
events in [0, t + u[as , 

Wo{t + u) = Wo{t)wo{u) . 
Hence, Wo{t) must obey Cauchy's functional equation, and 

WQ{t) = exp(ct) = exp(— At) . 

Since Wo{t) is a probability distribution, Wo{t) < 1, and A > 0. 

Hence, v{t) = l—Wo{t) = 1— exp(— At), the probability of one or more events occurring 
before t > 0, must be the familiar exponential distribution. 

For A; > 1 occurrences before t + u, the general decomposition relation is 

n 

Wn{t + U) = ^Wk{t)Wn-k{u) ■ 
k=0 

Theorem 28: (Reny-Aczel) The general (non trivial) solution of this this system of 
functional equations has the form: 

<r,k> j=l j=l 

where the index set < r, /c, n > is defined as 

< r, k, n >= {ri, r2, • • • Tfc | ri + 2r2 . . . + krk = n} . 



B.IO. FINAL REMARKS 



287 



and < r,k > is a, shorthand ioi < r,k,k >. 

Proof. By induction: The theorem is true for A; = 0. Let us assume, as induction 
hypothesis, that it is true to A; < n. The last equation in the recursive system is 

n 

Wn{t + U) = '^Wk{t)Wn-k{u) = 

k=0 

n—1 k / J.W- k / \g . 

w^{t)e-'- + wMe-'' + e-^(*+") J2 H E 11 ^ 11 ^ " 

k=l <r,k> <s,n-k> i=l j=l ^' 

Defining 

the recursive equation takes the form 

fnit + u) = fn{t)+fn{u) , 

and can be solved as a general Cauchy's equation, that is. 

Prom the last equation and the definition of fn{t), wc get the expression of Wn{t) as in 
theorem 28. The constant A is chosen so that the distribution is normalized. 

The general solution given by theorem 28 represents a composition (mixture) of Poisson 
processes, where an event in the j'-the process in the composition corresponds to the 
simultaneous occurrence of j single events in the original homogeneous Markov process. 
If we impose the following rarity condition, the general solution is reduced to a mixture 
of ordinary Poisson processes. 

Rarity Condition: The probability that an event occurs in a short time at least once is 
approximately equal to the probability that it occurs exactly once, that is, the probability 
of simultaneous occurrences is zero. 

B.IO Final Remarks 

This work is in memory of Professor D Basu who was the supervisor of the first author PhD 
dissertation, the starting point for the research in Bayesian analysis of categorical data 
presented here. A long list of papers follows Basu and Pereira (1982). We have chosen 
a few that we recommend for additional reading: Albert (1985), Gunel (1984), Irony, 
Pereira and Tiwari (2000), Paulino and Pereira (1992, 1995) and Walker (1996). To make 
the analysis more realistic, extensions and mixtures of Dirichlet also were considered. For 



n-1 



- E n 



cjty 



<r,n— l,n> j=l ■' 
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instance see Albert and Gupta (1983), Carlson (1977), Dickey (1983), Dickey, Jiang and 
Kadane (1987), and Jiang, Kadane and Dickey (1992). 

Usually the more complex distributions are used to realistic represent situations for 
which the strong properties of Dirichlet seems to be not realistic. For instance, in a 2 x 2 
contingency table, the first line to be conditional independent of the second line given the 
marginal seems to be unrealistic in some situations. Mixtures of Dirichlet in some cases 
take care of the situation as shown by Albert and Gupta (1983). 

The properties presented here are also important in non-parametric Bayesian statistics 
in order to understand the Dirichlet process for the competitive risk survival problem. See 
for instance Sahnas-Torres, Pereira and Tiwari (1997, 2002). In order to be historically 
correct we cannot forget the important book of Wilks, pubhshed in 1962, where one can 
find the definition of Dirichlet distribution. 

The material presented in this essay adopts a singular representation for several dis- 
tributions, as in Pereira and Stern (2005). This representation is unusual in the statistical 
literature, but the singular representation makes it simpler to extend and generalize the 
results and greatly facilitates numerical and computational implementations. 

We end this essay presenting the Reny-Aczel characterization of the Poisson mixture. 
This result can be interpreted as an alternative to de Finetti characterization theorem 
introduced in Finetti (1937). Using the characterization of binomial distributions by 
Poisson processes conditional arguments, as given by Theorem 4, and Blackwell (minimal) 
sufficiency properties discussed in Basu and Pereira (1983), Section 9 leads in fact to a 
De Finetti characterization for Binomial distributions. Also, if one recall the indifference 
principle (Mendel, 1989) the finite version of Finetti argument can simply be obtained. 
See also Irony and Pereira (1994) for the motivation of these arguments. The consideration 
of Section 9 could be viewed as a very simple formulation of the binomial distribution 
finite characterization. 



Appendix C 
Model Miscellanea 



"Das Werdende, das ewig wirkt und lebt, 
Umfass euch mit der Liebe holden Schranken, 
Und was in schwankender Erscheinung schwebt, 
Befestiget mit dauernden Gedanken!" 

The becoming, which forever works and hves, 
Holds you in love's gracious bonds, 
And what fluctuates in apparent oscillations. 
Fix it in place with enduring thoughts! 

Johann Wolfgang von Goethe (1749-1832), 
The Lord, in Faust, prologue in heaven. 

"Randomness and order do not contradict each 
other; more or less both may be true at once. 
The randomness controls the world and due to 
this in the world there is order and law, which 

can be expressed in measures of random events 
that follow the laws of probability theory. " 

Alfred Renyi (1921 - 1970). 

This appendix collects the material in some slide presentations on a miscellanea of 
statistical models used during the curse to illustrate several aspects of the FBST use and 
implementation. This appendix is not intended to be a self sufficient reading material, 
but rather a guide or further study. Section 1, on contingency table models, is (I hope) 
fully supplemented by the material on the Multinomial-Dirichlet distribution presented 
in appendix B. These models are of great practical importance, and also relatively simple 
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to implement and easy interpret. These characteristics make them ideal for the several 
statistical "experiments" required in the home works. Section 2, on a Wibull model, 
should require only minor additional reading, for further details see Barlow and Prochan 
(1981) and Ironi et al. (2002). This model highlights the importance of being able to 
incorporate expert opinion as prior information. 

Sections 3 to 5, presenting several models based on the Normal- Wishard distribution, 
may require extensive additional readings. Some epistemological aspects of these models 
are discussed in chapters 4 and 5. The material in these sections is presented for comple- 
tude, but its reading is optional, and only recommended for those students with a degree 
in statistics or equivalent knowledge. Of course, it is also possible to combine Normal- 
Wishad and Multinomial-Dirichlet models, in the form of mixture models, see section 6 
and Lauretto and Stern (2005). Section 7 presents an overview of the REAL classification 
tree algorithm, for further details see Lauretto et al. (1998). 

C.l Contingency Table Models 

Homogeneity test in 2 x 2 contingency table 

This model is useful in many applications, like comparison of two communities with re- 
lation to a disease incidence, consumer behavior, electoral preference, etc. Two samples 
are taken from two binomial populations, and the objective is to test whether the success 
ratios are equal. Let x and y be the number of successes of two independent binomial 
experiments of sample sizes m and n, respectively. The posterior density for this multi- 
nomial model is. 



The Bayes Factor considering a priori Pr{H} = Pr{6i = 6^} = 0.5 and uniform 
densities over 6o and © — ©o is given in the equation below. See [?] and [?] for details 
and discussion about properties. 



The parameter space and the null hypothesis set are: 



© = {0 < ^ < 1 I ^1 + ^2 = 1 A ^3 + ^4 



1} 



©0 = e © I ^1 = ^3} 



BF 




(m + l)(n + 1) 
'm + n + 1 
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Independence test in a 2 x 2 contingency table 

Suppose that laboratory test is used to help in the diagnostic of a disease. It should 
be interesting to check if the test results are really related to the health conditions of a 
patient. A patient chosen from a clinic is classified as one of the four states of the set 

{{h,t) \h,t^Oorl} 

in such a way that h is the indicator of the occurrence or not of the disease and t is the 
indicator for the laboratory test being positive or negative. For a sample of size n we 
record (xqo, a^oi; ^^lo, a^^n), the vector whose components are the sample frequency of each 
the possibilities of {t,h). The parameter space is the simplex 

= {(6'oO) 6*01, 6'io, 6'ii) I 9ij > A ^6*1^ = 1} 

and the null hypothesis, h and t are independent, is defined by 

Qq = {9 E Q \ 9oo — 9o,9,o, 9o, = 9oo + 9oi, 9,o — ^oo + ^lo}- 
The Bayes Factor for this case is discussed by [Iro 95] and has the following expression: 



BF 



Xo, \ I Xu 

xoo J \ xn J j {n + 2) {{n + 3) - { n + 2)[P{1 - P) + Q{1 - Q)]} \ 

4(n + l) J 



n 
x,o 



where Xi, = Xio + xa, x,j = xoj + xij, P = and Q = 



C.2 Weibull Wearout Model 

We where faced with the problem of testing the wearout of a lot of used display panels. 
A panel displays 12 to 18 characters. Each character is displayed as a 5 x 8 matrix of 
pixels, and each pixel is made of 2 (RG) or 3 (RGB) individual color elements, (like a light 
emitting diode or gas plasma device). A panel fails when the first individual color clement 
fails. The construction characteristics of a display panel makes the weibull distribution 
specially well suited to model its life time. The color elements are "burned in" at the 
production process, so we assume they are not at the infant mortality region, i.e. we 
assume the WeibuU's shape parameter to be greater than one, with wearout or increasing 
hazard rates. 

The panels in question were purchased as used components, taken from surplus ma- 
chines. The dealer informed the machines had been operated for a given time, and also 
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informed the mean life of the panels at those machines. Only working panels were ac- 
quired. The acquired panels were installed as components on machines of a different 
type. The use intensity of the panels at each type of machine corresponds to a different 
time scale, so mean lifes are not directly comparable. The shape parameter however is 
an intrinsic characteristic of the panel. The used time over mean life ratio, p = a///, is 
adimensional, and can therefore be used as an intrinsic measure of wearout. We have 
recorded the time to failure, or times of withdrawal with no failure, of the panels at the 
new machines, and want to use this data to corroborate (or not) the wearout information 
provided by the surplus equipment dealer. 

WeibuU Distribution 

The two parameter Weibull probability density, reliability (or survival probability) and 
hazard functions, for a failure time t >0, given the shape, and characteristic life (or scale) 
parameters, (3 > 0, and 7 > 0, are: 



By altering the parameter, (3. W{t\f3,j) takes a variety of shapes, Dodson(1994). 
Some values of shape parameter are important special cases: for /3 = 1, is the exponen- 
tial distribution; for (3 = 2, W is the Rayleigh distribution; for (3 = 2.5, W approximates 
the lognormal distribution; for /3 = 3.6, W approximates the normal distribution; and for 
^ = 5.0, 1^ approximates the peaked normal distribution. The flexibihty of the Weibull 
distribution makes it very useful for empirical modeling, specially in quality control and 
reliability. The regions (3 < 1, f3 — 1, and (3 > 1 correspond to decreasing, constant and 
increasing hazard rates. These three regions are also known as infant mortality, mem- 
oryless, and wearout failures. 7 is approximately the 63rd percentile of the life time, 
regardless of the shape parameter. 

The Weibull also has important theoretical properties. If n i.i.d. random variables 
have Weibull distribution, Xi ~ w{t | /3,7), then the first failure is a Weibull variate with 
characteristic life -^[i,n] ~ w{t\ 13,'j/n^^^). This kind of property allows a 

characterization of the Weibull as a limiting life distribution in the context of extreme 
value theory. Barlow and Prochan (1975). 



w{t\P,^) 

z{t\f3n) 



{/3t^-'/^^)exp{-{t/^f) 
exp{-{thf) 



The mean and variance of a Weibull variate are given by: 



^x = 7r(i + i/^) 

= 72(r(l + 2/^) + r2(l + l/^)) 
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The affine transformation t = t' + a leads to the three parameter truncated Weibull 
distribution. A location (or threshold) parameter, a > represents beginning observation 
of a (truncated) Weibull variate at t = 0, after it has already survived the period [—a, 0[. 
The three parameter truncated Weibull is given by: 

w{t\a,(3,j) = {f3{t + a)^-'/j^)exp{-{{t + a)/jf)/r{a\/3,j) 
r(t|a,/3,7) = exp{-{{t + a)/^f)/r{a\ P,^) 



Wearout Model 

The problem described at the preceding sections can be tested using the FBST, with 
parameter space, hypothesis and posterior joint density: 

e = {(a,/3,7) G ]0,oo] X [l,oo] X [0,oo[} 
Oo = {(a,/3,7) e e|Q; = p//(^,7) } 

n m 

/(a,/3,7|D) oc JJw(ti|a,/3,7)J]^r(tj|a,/3,7) 
i=i j=i 

where the data D are all the recorded failure times, ti > 0, and the times of withdrawal 
with no failure, tj > 0. 

At the optimization step it is better, for numerical stability, to maximize the log- 
likelihood, fl{ ). Given a sample with n recorded failures and m withdrawals, 

wh = log(^) + - 1) logiU + log(7) + a)hf + {ahf 

rlj = -{{t,+a)/^r + {a/^r 

n m 

fl = . + ^ rlj 

i=l 0=1 

the hypothesis being represented by the constraint 

M«,A7)=P7r(l + l/^)-a = 

The gradients of //( ) and h{ ) analytical expressions, to be given to the optimizer. 
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are: 



dwl = 

[ (/3 - + a)- {{t + «)/7)"/3/(t + a) + H^/f^/a , 
1/P + log(t + a) - log(o') - {{t + a)/7)^ log((t + a)/-f) + log(a/7) , 

-/5/7 + ((t + a)/7)''/3/7 - («/7)''/3/7 ] 

drl = 

[ -{{t + a)hfp/{t + a) + {ahfP/a , 
-{{t + a)/7)^ log((t + a)/7) + (a/7)^ log(a/7) , 
((i + «)/7m7,-K7)^/3/7] 

dh = 

[ -1 , -P7r'(i + r(i + , pr(i + ] 



For gamma and digamma functions efficient algorithms see Spanier and Oldham (1987). 

In this model, some prior distribution of the shape parameter is needed to stabilize 
the model. Knowing color elements' life time to be approximately normal, we consider 
13 e [3.0,4.0]. 



C.3 The Normal- Wishart Distribution 

The matrix notation used in this section is defined in section F.l. 

The Bayesian research group at IME-USP has developed several applications based on 
multidimensional normal models, including structure models, mixture models and factor 
analysis models. In this appendix we review the core theory of some of these models, since 
they are used in some of the illustrative examples in chapters 4 and 5. For implementation 
details, practical applications, case studies, and further comments, see Lauretto et al. 
(2003). 

The conjugate family of priors for multivariate normal distributions is the Normal- 
Wishart family of distributions, DeGroot (1970). Consider the random matrix X with 
elements X/ ,i = 1 . . . k , j = 1 . . .n , n > k, where each column, , contains a sample 
vector from a /c-multivariate normal distribution with parameters /3 (mean vector) and V 
(covariance matrix), or R — (precision matrix). 
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Let X and W denote, respectively, the statistics: 

1 " 1 

m — ^ in 



X 

n ^-^ n 

n 

The random matrix W has Wishart distribution with n degrees of freedom and precision 
matrix R. The Normal and Wishart pdfs have the expressions: 

f{x\n,l3,R) = {^)'^/^\R\y'eM-'^i^-PyRi^-P)) 
f(W\n,p,R) = c I I exp(-^tr(iyi?)) 

k I 1 _ ■ 

^-1 ^ |^|-n/2onfc/2^fc(fc-l)/4 J-j-p^ ^+ ^ J 

Now consider the matrix X as above, with unknown mean /3 and unknown precision 
matrix R, and the statistic 

n 

S = - x) {x^ - x)' = {X -x){X - x)' 

Taking as prior distribution for the precision matrix R the wishart distribution with 
a > k — 1 degrees of freedom and precision matrix S and, given R, taking as prior for (3 
a multivariate normal with mean $ and precision fiR, i.e. 

p{l3,R) = p{R)p{P\R) 

p{R) oc exp(-Jtr(i?S')) 



The posterior distribution for the parameters j3 and R has the form: 

Pn{P,R\n,x,S) = pn{R\n,x,S)pn{l3\R,n,x,S) 
Pn{R\n,x,S) oc exp(-^tr(i?>S)) 

Pn{P\R,n,x,S) cx |i?|^/'exp(-^(/3-/3yi?(/3-/3)) 
/5 = {nx + h$) /h , h — n-\-h 
S = s + S+^^i^-xMp-xY 
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Hence, the posterior distribution for i? is a Wishart distribution with a + n degrees of 
freedom and precision 5*, and the conditional distribution for /3, given R, is A';-Normal 
with mean /3 and precision iiR. All covariance and precision matrices are supposed to be 
positive definite, n > k, a > k — 1, and n > 0. 

Non-informative improper priors are given by n = 0, /3 = 0, a = 0, S* = 0, i.e. we take 
a Wishart with degrees of freedom as prior for R, and a constant prior for /3, Box and 
Tiao (1973), DeGroot (1970), Zellner (1971). Then, the posterior for i? is a Wishart with 
n degrees of freedom and precision S, and the posterior for (3, given R, is /c-Normal with 
mean x and precision nR. 

We can now write the simplified log-posterior kernels: 

fl(P, R\n,x,S) = fl{R I n, x, S) + fl(p \ R, n, x, S) 

fl{R\n,x,S)=flr = " + ~ ^ ~ ^ log(|i?|) - ^tr(i?^) 

1 ri ■■ ■■ 

fl{f3\R,n,x,S)^flb = -log(|it:|) --(^-^)'i?(^-^) 

For the surprise kernel, relative to the uninformative prior, we only have to replace the 
factor {a + n- k -l)/2hy {a + n) /2. 

C.4 Structural Models 

In this section we study the dose-equivalence hypothesis. 

The dose-equivalence hypothesis, H, asserts a proportional response of a pair of re- 
sponse measurements to two different stimuli. The hypothesis also asserts proportional 
standard deviations, and equivalent correlations for each response pair. The proportion- 
ality coefficient, 6, is interpreted as the second stimulus dose equivalent to one unit of the 
ffist. 

This can be seen as a simultaneous generalization of the linear mean structure, the 
linear covariance structure, and the Behrens-Fisher problems. The test proved to be useful 
when comparing levels of genetic expression, as well as to calibrate micro array equipment 
at BIOINFO, the genetic research task force at University of Sao Paulo. The application 
of the dose-equivalence model is similar to the much simpler bio-equivalence model used 
in pharmacology, and closely related by several other classic covariance structure models 
used in biology, psychology, and social sciences, as described in Anderson (1969), Bock and 
Bargnann (1966), Jiang and Sarkar (1998, 1999, 2000a,b), Joreskog (1970), and McDonald 
(1962, 1974, 1975). We are not aware of any alternative test for the dose-equivalence 
hypothesis published in the literature. 
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C.4.1 Mean and Covariance Structure 



As it is usual in the covariance structure literature, we will write ^(7) = ^ ■yhG{h}, where 
the matrices G{h}, h = 1, ... k{k + l)/2 form a basis for the space oi k x k symmetric 
matrices; in our case, k — A. The matrix notation is presented at Section F.l. 



10 



^(7) = J2^hG{h} 



h=l 



7i 75 77 78 

75 72 79 7io 

77 79 73 76 

78 710 76 74 



where 



G{h} 



SI 



SI SI 



SI SI 51 51' 
SI SI 51 51 
SI C SI 5i 



and the Kronecker-delta is 5^ = 1 if /i = j and 6^ = ii h ^ j. 

The dose-equivalcncc hypothesis, H, asserts a proportional response of a pair of re- 
sponse measurements to two different stimuli. Each pair of response measurements is sup- 
posed to be a bivariate normal variate. H also asserts proportional standard deviations, 
and equivalent correlations for each pair of response measurements. The proportionality 
coefficient, 5, is interpreted as the dose, calibration or proportionality coefficient. 

In order to get simpler expressions for the log-likelihood, the constraints and its gra- 
dients, we use in the numerical procedures an extended parameter space including the 
coefficient 5, and state the dose-equivalence optimization problem on the extended 15- 
dimentional space, with a 5-dimentional constraint: 



e 

eo 

h{0) 



{9 = [7', (3', 5]' G R 

{eee\ h{e) = 0} 

5^71 - 73 
(^^72 - 74 

S'^^5 - 76 

5 Pi - ^3 
5^2 - Pa 



10+4+1 



V{l) > 0} 



In order to be able to compute some gradients needed in the next section, we recall 
some matrix derivative identities, see Anderson (1969), Harville (1997), McDonald and 
Swaminathan (1973), Rogers (1980). We use V — V{'y), R — V~^, and C for a constant 
matrix. 



dV 

d/3'C/3 
d/3 



-G{h} 

= 2C/3, 



OR 

d\og{\V\) 



= -RG{h}R , 

ir{RG{h}) , 
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We also define the auxiliary matrices: 

P{K\ = R G{K\ , Q{h} = P{h} R . 

C.4.2 Numerical Optimization 

To find 9* we use an objective function, to be minimized on the extended parameter space, 
given by a centralization term minus the log-posterior kernel, 

f{e\n,x,S) = cntoh2{V -C) - fir - fib 

= cnfrob2(y-C) - " ^ ^ ~ ^ log(|i?|) 

+^tr(it:^)+|(^-^yit:(^-/3) 

Large enough centralization factors, c, times the squared Frobenius norm of {V —C\ where 
C are intermediate approximations of the constrained minimum, make the first points of 
the optimization sequence remain in the neighborhood of the empirical covariance (the 
initial C). As the optimization proceeds, we relax the centralization factor, i.e. make c — >■ 
0, and maximize the pure posterior function. This is a standard optimization procedure 
following the regularization strategy of Proximal- Point algorithms, see Bertzekas and 
Tsitsiklis (1989), lusem (1995), Censor and Zenios (1997). In practice this strategy let us 
avoid handling explicitly the difficult constraint ^^(7) > 0. 

Using the matrix derivatives given in the last section, we find the objective function's 
gradient, d f/dO, 

^±|^tr(P{M) -\HQ{h}S) 

-lW-'$)'Q{h}{fi--$) 

n 

+2cn ^{V -C)QG{h} 
ni?(/3-/3) 

For the surprise kernel and its gradient, relative to the uninformative prior, we only have 
to replace the factor {a + n — k)/2 by {a + n + l)/2. 



dj_ 



dl 
dB 
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The Jacobian matrix of the constraints, d h/d9, is: 
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At the optimization step, Variable-Metric Proximal- Point algorithms, working with 
the explicit analytical derivatives given above, proved to be very stable, in contrast with 
the often unpredictable behavior of some methods found in most statistical software, like 
Newton- Raphson or "Scoring". Optimization problems of small dimension, like above, 
allow us to use dense matrix representation without significant loss. Stern (1994). 

In order to handle several other structural hypotheses, we only have to replace the 
constraint, and its Jacobian, passed to the optimizer. Hence, many different hypothesis 
about the mean and covariance or correlation structure can be treated in a coherent, 
efficient, exact, robust, simple, and unified way. 

The derivation of the Monte Carlo procedure for the numerical integrations required 
to implement the FBST in this model is presented in appendix G. 



C.5 Factor Analysis 

This section reviews the most basic facts about FA models. For a synthetic introduction 
to factor analysis, see Ghaharamani and Hilton (1997) and Everitt (1984). For some 
of the matrix analytic and algorithmic details, see Abadir and Magnus (2005), Golub 
and Loan (1989), Harville (2000), Rubin and Thayer (1982), and Russel (1998). For the 
technical issue of factor rotation, see Browne (1974, 2001), Jennrich (2001, 2002, 2004) 
and Bernaards and Jennrich (2005). 

The generative model for Factor Analysis (FA) is a; = Az + u, where x is a p x 1 vector 
of observed random variables, 2; is a A; x 1 vector or latent (unobserved) random variables, 
known as factors and A is the p x k matrix of factor loadings, or weights. FA is used as 
a dimensionality reduction technique, so k < p. 

The vector variates z and u are assumed to be distributed as A/'(0,/) and A/'(0, ^), 
where "if is diagonal. Hence, the observed and latent variables joint distribution is 

X 

z 









AA' + ^ A 
A' / 
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For two jointly distributed Gaussian (vector) variates, 



X 

z 



a 
b 



A C 
C D 



the distribution of z given x is given by, see Zellner (1971), 

z\x^M{h + C'A-'^{x-a), D-C'A-^C) . 
Hence, in the FA model, 

z\x ^ H {Bx, I - BA) , where 
B = A'(AA' + ^)-^ = (^^-^ - ^-^A (/ + A'^-^A) A'^"^) 

C.5.1 EM Algorithm 

In order to obtain the Maximum Likelihood (ML) estimator of the parameters, one can 
use the EM- Algorithm, see Rubin and Thayer (1982) and Russel (1998). The E-step for 
the FA model computes the expected first and second moments of the latent variables, 
for each observation, x. 

E{z I x) — Bx , and 

E{zz I x) = Cov{z I x) + E{z I x)E{z {x)' = I + BA + Bx x'B' 

The M-step optimizes the parameters A and ^, of the expected log likelihood for the 
FA (completed data) model, 

g(A,*) = E(lognJ^^/(^,^|A,*)) 

= E (^og]Y_^{2nf/^ l^r'/'exp (^-^ {x^ - Az)' {x^ - Az)^^ 

= c - ^ log 1^1 - E (l^x^'^-^x^ - x^'^If-^Az + l^z'A'^-^Az) 

Using the results computed in the E-step, the last summation can be written as 

Q^^'*"^^^ - x^'^~^AE{z I x^) + ^tr {l^^-^AE{zz' \ x^))^ 

The ML estimator, (A*, \1'*), is a stationary point in A*, therefore 

^ = -V *-V£;(z| x^)' + V ^-^A£;(^z'|x^) = , hence 

A* = f E{zz' I x^)') ~' x^E{z I x^y 
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and also a stationary point in \E'*, or in its inverse, therefore, substituting the stationary 
value of A* computed in the last equation, 

= - V ( -x'x^' - A*E(z I x^)x^' + -A*E(zz' I a;^)A*'^ = 
Solving for and using the diagonality constraint, 

= ^diag^ (Z^^-i ~ ^* Zl^-i I ^^)^^') 

The equation for A*, in the M-stcp of the EM algorithm for FA, formally resembles 
the equation giving the LS estimation in a Linear Regression model, ff = y'X{X'X)~^ . 
This is why, in the FA literature, the matrix A* is sometimes interpreted as "the linear 
regression coefficients of the z^s on the x's". 

C.5.2 Orthogonal and Oblique Rotations 

Given a FA model and a non-singular coordinate transform, T, it is possible to obtain 
transformed factors together with transformed loadings giving an equivalent FA model. 
Both, a direct, and an inverse, form of the factor loadings transform are common in the 
literature. 

In the direct form, 

z = T-'^z and A = AT, 

hence, in the new model, 

x^ Az + u^ ATT~^z + Az + u and 

Cov(a;) = AA' + * = AT {T'^ IT-^)T' A! + * = AA' + ^ . 
In the inverse form, 

z = T'z and A = AT"*, 

hence, in the new model, 

x = Az + u = AT'^T'z + u = Az + u and 

Cov(j;) = AA' + * = AT-\T'IT)T-^A! + * = AA' + * . 

This shows that the FA model is only determined by the k dimensional subspace of 
TU" spanned by the factors. Any change of coordinates in this (sub) space, given by T, 
leads to an equivalent model. 

An operator T is an orthogonal rotation iff T'T = /. Hence, orthogonal transformed 
factors are still normalized and uncorrelated. 
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An operator T is an oblique rotation (in the inverse form) iff diag2(r'r) = /. Hence, 
oblique transformed factors are still normalized, but correlated. 

We want to chose either an orthogonal or an oblique rotation. T, so to minimize a 
complexity criterion function of A. Before discussing appropriate criteria and how to use 
them, we examine some technical details concerning matrix norms and projections in the 
following subsection. 



C.5.3 Frobenius Norm and Projections 

The matrix Frobenius product and the matrix Frobenius norm are defined as follows: 

{A\B)p,^tj:(A'B) ^1'(AqB)1 , 



Lemma 1: The projection, T, with respect to the Frobenius norm, into the algebraic 
sub-manifold of the oblique rotation matrices 

of a square matrix. A, is given as follows, 

T = Adiag'iA'Ay/^ . 

A matrix T represents an oblique rotation iff it has normalized columns, that is, iff 
diag(T'T) = 1. We want to minimize 

\\A-T\\l^J2-\\^^-^^\\l 



But, in the 2-norm, the normal vector that is closest to the A^ is the one that has the 
same direction of vector A^ , that is, 

— ^ A.i = ^ A-i 
||A^1|2 {A^'A:>y/^ ' 

hence, the lemma. 

Lemma 2: The projection, Q, with respect to the Frobenius norm, into the algebraic 
sub-manifold of the orthogonal rotation matrices of a square matrix. A, is given by its 
SVD factorization, as follows, 

Q = UV where U'{A)V = diag(s) . 

In order to prove the second lemma, we will consider the following problem. The 
orthogonal Procrustes problem seeks the orthogonal rotation, Q \ Q'Q = I, that minimizes 
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the Frobcnius norm of the difference between a given matrix A, m x p, and the rotation 
of a second matrix, B, Formally, the problem is stated as 

min I \A — 3011% 

Q\Q'Q=I 



The norm function being minimized can be restated as 

\\A - BQWl = tT{A'A)+tT{B'B) - 2tr(Q'S'A) 

Hence the problem asks for the maximum of the last term. Let Z be an orthogonal matrix 
defined by Q and the SVD factorization of B'A as follows, 

U'{B'A)V ^S^ diag(s) , Z = V'Q'U . 

We have, 

tr{Q'B'A) = tr{Q'USV') = tr{ZS) = s'diag(Z) < s'l . 

But the last inequality is tight ii Z — I, hence the optimal solution for the orthogonal 
Procrustes problem is 

Q = UV where U'{B'A)V = diag(s) . 
In order to prove lemma 2, just consider the case B — I. 



C.5.4 Sparsity Optimization 

In the FA literature, minimizing the complexity of the factor loadings. A, is accomplished 
by maximizing a measure of its sparsity, /(A). 

A natural sparsity measure in engineering applications is the Minimum Entropy mea- 
sure. This measure and its (matrix) derivative are given by 

/^e(A) = - ( A2 I log(A2) )^ , {A2)l = (A])' . 

%^ = -A0log(A2)-A. 
aA 

Several variations of the entropy sparsity measure are used in the literature, see Bernaards 
and Jennrich (2005). 

Hoyer (2004) proposes the following sparsity measure for a vector x e TZ'^, based on 
the difference of two p-norms, namely p — 1 and p — 2, 



M.) = ^(V^-^) 
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Prom Cauchy-Schwartz inequality, we have the bounds, 

< \ \x\\2 < , hence < fho{x) < 1 . 



Similar interpretations can be given to the Carroll's Oblimin, on the parameter 7, and 
Crawford-Ferguson, on the parameter families of sparsity measures. These measures, 
for A p X k, and its (matrix) derivative are given by, 

/^(A) = ^(A2|S(7))^ , where 

B(-f) ^ (I --fC)A2N , 

A(A) = ^(A2|5(/€))^ , where 
B{k) = {1- k)A2N + kMA2 , 
(A2)f = (A^)\ M/ = l-5f,pxp, Ni^l-5Ukxk. 

These parametric families include many sparsity measures, or simplicity criteria, tradi- 
tionally used in psychometric studies, for example, setting 7 to 0, 1/2, or 1, we have 
the Quartmin, Biquartmin or Covarimin criterium, also, setting k to 0, 1/p, k/{2p) or 
{k — l)/{p + k — 2), we have the Quartimax, Varimax, Equamax or Parsimax criterion. 

In order to search for an optimal transformation. T* . we need to express the sparsity 
function and its matrix derivative as functions of T. In the direct form, 

df{A) ^ df{AT) ^ f ,>df{A)\' 
dT dT \ dA J ' 



In de inverse form. 



dfiA) _ rf/(AT-*) _ f^^,dfiA)^_,\ 



dT dT V dA 



This expressions, together with the projectors obtained in the last section, can be used 
in standard gradient projection optimization algorithms, like the Generalized Reduced 
Gradient (GRG) or other standard primal optimization algorithms, sec Bernaards and 
Jennrich (2005), Jennrich (2002), Luenberger (1984), Minoux and Vajda (1986), Shah et 
al. (1964), and Stern et al. (2006). 
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The projection operation for oblique rotations only requires inexpensive matrix oper- 
ations, like a matrix inversion, performed numerically as a LU or QR factorization. The 
projection operation for orthogonal rotations, on the other hand, requires a SVD factoriza- 
tion, an operation that requires much more computational work. Therefore, a constraint 
free representation of an orthogonal matrix can be very useful in designing optimization 
algorithms, see Browne (1974, 2001). The Cayley transform estabhshes one-to-one corre- 
spondence between skew-symmetric operators, K, and the orthogonal operators, Q, that 
do not have —1 as a characteristic value, see Gantmacher (1959, I, 288-289). Although 
extreme reversal operators, like a coordinate reflection or permutation can not be rep- 
resented in this form, there is a Cayley representation for any local, that is, not too far 
from the identity, orthogonal operator. 

J^K + I , Ki^-K] . 

K^{I-Q){I + Q)-^ = 2(J + g)-i - / , 
Q = {I - K){I + K)-^ = 2J-^ - I . 

The spasity measure derivatives of the direct orthogonal rotation of the factor loadings, 
using the Cayley representation, are given by, 

/(A) = /(Ar), T^r'-i 

dm _ . ( dmdT_\ _ ( df{Ai j-idj_ \ _ 

dJi "''l^aT' dJi)~ \ dT djf )~ 

2(Y^ - Yj) , where Y = J'^^^J'' ■ 

C.6 Mixture Models 

The matrix notation used in this section is defined in section F.l. In this section, h, i are 
indices in the range 1 : A; is in 1 : m, and j is in 1 : n. 

In a d-dimensional multivariate finite mixture model with m components (or classes), 
and sample size n, any given sample is of class k with probability Wk] the weights, Wk, 
give the probability that a new observation is of class k. A sample j of class k = c{j) is 
distributed with density f{x^ \ ip^). 

The classifications z-^ are boolean variables indicating whether or not is of class 
k, i.e. zl = 1 m c{j) = k. Z is not observed, being therefore named latent variable or 
missing data. Conditioning on the missing data, we get: 
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Given the mixture parameters, 9, and the observed data, X, the conditional classifi- 
cation probabilities, P = f{Z \X,9), are: 



We use yk for the number of samples of class k, i.e. i/k — Ylj or y — Zl. The 
likelihood for the "completed" data, X, Z, is: 

We will see in the following sections that considering the missing data Z, and the 
conditional classification probabilities P, is the key for successfully solving the numerical 
integration and optimization steps of the FBST. In this article we will focus on Gaussian 
finite mixture models, where f{x^ \ -0^) = N{xJ \ . a normal density with mean 

and variance matrix V{k}^ or precision R{k} = (y{k})^^. Next we specialize the theory 
of general mixture models to the Dirichlet-Normal-Wishart case. 



C.6.1 Dirichlet-Normal-Wishart Mixtures 

Consider the random matrix X/, i in 1 : d, j in 1 : n, n > d, where each column contains a 
sample element from a ci-multivariate normal distribution with parameters b (mean) and 
V (covariance) , or i? = (precision). Let u and S denote the statistics: 

u = (1/n) = (1/n) XI , 5 = y" {x^ - 6) (g) {x^ - h)' ^ {X - b){X - h)' 

The random vector u has normal distribution with mean h and precision nR. The 
random matrix S has Wishart distribution with n degrees of freedom and precision matrix 
R. The Normal, Wishart and Normal- Wishart pdfs have expressions: 

N{u\n,h,R) = {^Y''^\R\^''^ ex^{-{n/2){u-h)'R{u-h)) 

W{S\e,R)^c-^\Sf-^-^^'^ exp(-(l/2)tr(5i?)) 

with normalization constant c = \R\~''''^ 2^*^/2 T^did-i)/^ jj^^^ ^{{e - i + l)/2) . 

Now consider the matrix X as above, with unknown mean h and unknown precision 
matrix R, and the statistic 

5 = y" (x^ -u)® ix^ - u)' = {X- uMX - u)' 

The conjugate family of priors for multivariate normal distributions is the Normal- 
Wishart, see DeGroot (1970). Take as prior distribution for the precision matrix R the 
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wishart distribution with e > d — 1 degrees of freedom and precision matrix S and, given 
R, take as prior for b a multivariate normal with mean u and precision nR, i.e. let us take 
the Normal- Wishart prior NW{b, R\h,e, ii, S). Then, the posterior distribution for R is 
a Wishart distribution with e degrees of freedom and precision S, and the posterior for 
b, given R, is A;- Normal with mean u and precision hR, i.e., we have the Normal- Wishart 
posterior: 

NW{b, R\n,e, il, S) = W{R \ e, S) N{b \ n, ii, R) 
h — h + n , e — e + n , u — {nu + nil) / h 
S = S + S + {nn/h){u-u)®{u-u)' 

All covariance and precision matrices are supposed to be positive definite, and proper 

priors have e > d, and n > 1. Non- informative Normal- Wishart improper priors are given 
by n = 0, -u = 0, e = 0, = 0, i.e. we take a Wishart with degrees of freedom as 
prior for R, and a constant prior for b, see DeGroot (1970). Then, the posterior for R is 
a Wishart with n degrees of freedom and precision S, and the posterior for b, given R, is 
d-Normal with mean u and precision nR. 

The conjugate prior for a multinomial distribution is a Dirichlet distribution: 

M{y\n,w) = {n\/yi\...ym\) wi^' ...Wm^"" 

D{w I y) = {T{y, + ... + yk)/T{y,) . . . T{y,)) ^ w^^^'' 

with w > and wl = 1. Prior information given by ij, and observation y, result in the 
posterior parameter y — y + y. A non- informative prior is given hy y — 1. 

Finally, we can write the posterior and completed posterior for the model as: 

f{e\x,e) = f{x\e)f{e\e) 
f{x 1 9) = W^, ^ Y.Z^Pi'^kNix^ 1 6^ R{k}) 

f{9 I 9) = D{w 1 2/) nr , ^^(^'' ^{^> I "^k, u\ S{k}) 
Pi = WkN{x' I 6\ R{kY) I ^^^^ WkN{x' I 6^ R{k}) 

f{9 I X, Z, 9) = f{9 I X, Z)f{9 I 9) = D{w \y)]T ^ NW{b\ R{k} \ n,, 4, u\ S{k}) 
y^Zl , y^y + y , h^h + y , e^e + y 
u' = (1/2/.) 5^^^^ 4^' , S{k} = ^J^^ ziix^ - u') ® {x^ - u'y 

vl" = (l/yk)(hku'' + yku'') , S{k} = S{k} + S{k} + {nm/nk){u^ - u^) ® {u^ - u^)' 
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C.6.2 Gibbs Sampling and Integration 

In order to integrate a function over tiie posterior measure, we use an ergodic Markov 
Chain. The form of the Chain below is known as Gibbs samphng, and its use for numerical 
integration is known as Markov Chain Monte Carlo, or MCMC. 

Given d, we can compute P. Given P, f{z^ \ p^) is a simple multinomial distribution. 
Given the latent variables, Z, we have simple conditional posterior density expressions 
for the mixture parameters: 

f{w\Z,y)^D{w\y) , f{R{k}\X,Z,ek,S{k}) = W{R\ek,S{k}) 

fib" I X, Z, R{k}, nfc, ii") = N{b I rifc, ii^ R{k}) 

Gibbs sampling is nothing but the MCMC generated by cyclically updating variables 
Z, 9, and P, by drawing 9 and Z from the above distributions, sec Gilks aet al. (1996) 
and Haggstrom (2002). A uniform generator is all what is needed to the multinomial 
variate. A Dirichlet variatc w can be drawn using a gamma generator with shape and 
scale parameters a and (3, G{a,(3), see Gentle (1998). Johnson (1987) describes a simple 
procedure to generate the Cholesky factor of a Wishart variate W — U'U with n degrees 
of freedom, from the Cholesky factorization of the covariance V — R~^ — C'C , and a chi- 
square generator: a) g,^. = G{yk, 1) ■,h) Wk^Qk / ELi 9k ; c) for i <j , Bij = A^(0, 1) ; 
d) Bi^i — •\/x^(ri — i + 1) ; and e) U — BC . All subsequent matrix computations proceed 
directly from the Cholesky factors, see Jones (1985). 

Label Switching and Forbidden States 

Given a mixture model, we obtain an equivalent model renumbering the components 
l:m by a permutation (T([l:m]). This symmetry must be broken in order to have an 
identifiable model, see Stephens (1997). Let us assume there is an order criterion that 
can be used when numbering the components. If the components are not in the correct 
order. Label Switching is the operation of finding permutation a{[l : m]) and renumbering 
the components, so that the order criterion is satisfied. If we want to look consistently 
at the classifications produced during a MCMC run, we must enforce a label switching to 
break all non-identifiability symmetries. For example, in the Dirichlet-Normal-Mixture 
model, we could choose to order the components (switch labels) according to the the rank 
given by: 1) A given linear combination of the vector means, c' * b'^; 2) The variance 
determinant The choice of a good label switching criterion should consider not 

only the model structure and the data, but also the semantics and interpretation of the 
model. 

The semantics and interpretation of the model may also dictate that some states, like 
certain configurations of the latent variables Z, are either meaningless or invalid, and 
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shall not be considered as possible solutions. The MCMC can be adapted to deal with 
forbidden states by implementing rejection rules, that prevent the chain from entering the 
forbidden regions of the complete and/or incomplete state space, see Bennett (1976) and 
Meng (1996). 



C.6.3 EM Algorithm for ML and MAP Estimation 

The EM algorithm optimizes the log-posterior function fl{X \ 9) + fl{9 \ 9), see Dempster 
(1977), Ormoneit (1995) and Russel (1988). The EM is derived from the conditional 
log-likelihood, and the Jensen inequahty: II w,y > Q,w'l — 1 then logw'y > w'logy. 
Let 9 and 9 be our current and next estimate of the MAP (Maximum a Posteriori), 
and pI = f{zl \x^,9) the conditional classification probabihties. At each iteration, the 
log-posterior improvement is: 



5{9,9\X,9) = fl(9\X,9)-fl{9\X,9) = 5{9,9 \ X) + 5{9,9 \9) 
5{l9\9) = fl{9\9)-fl{9\9) 
5{9,9\X) = fl{X\~9)-fl{X\9) = y^,5{9,9\x^) 

5{9,9\x^) = fl{x^9)-fl{x^9) = logY^^Wkfix^iPk) -flix'\9) = 
= logV Pl^i^ > Ai9,9\x^) = y pilogMO?^ 

Hence, A(^, 9\X,9) = A{9, 9\X) + 6{9, ^ | ^), is a lower bound to 6{9, 9\X,9). Also 
A{9, 9\X,9) = 6{9,9 \ X,9) ~ 0. So, under mild differentiability conditions, both surfaces 
are tangent, assuring convergence of EM to the nearest local maximum. But maximizing 
A{9, 9\X,9) over 9 is the same as maximizing 

and each iteration of the EM algorithm breaks down in two steps: 
E-step: Compute P = E{Z \X,9) . 
M-step: Optimize Q{9,9) , given P. 

For the Gaussian mixture model, with a Dirichlet-Normal-Wishart prior, 

^(^~^) = Y^^^^^J2]^A^logw, + \ogN(x^\b^R{k}))+fl(9\^ 
fl{9\9) = logD{w\y) +y"' logNW{b'',R{k}\nk,h,u'',S{k}) 
Lagrange optimality conditions give a simple analytical solutions for the M-step: 
y^ PI , Wk = {yk + ilk-l) / (n-m + ^^^^ Vk) 



310 



APPENDIX a MODEL MISCELLANEA 



u 



yk 



'J = 

,k 



Tk iiku'^ + VkU^ f.fc S{k} + hk{h^ - ii^) ® {h'' - v!')' + S{k} 



rik + Vk 



Uk + h- d 



Global Optimization 

In more general (non-Gaussian) mixture models, if an analytical solution for the M-step 
is not available, a robust local optimization algorithm can be used, for example Martinez 
(2000). The EM is only a local optimizer, but the MCMC provides plenty of good starting 
points, so we have the basic elements for a global optimizer. To avoid using many starting 
points going to a same local maximum, we can filter the (ranked by the posteriori) top 
portion of the MCMC output using a clustering algorithm, and select a starting point from 
each cluster. For better efficiency, or more complex problems, the Stochastic EM or SEM 
algorithm can be used to provide starting points near each important local maximum, see 
Celeux (1995), Pflug (1996) and Spall (2003). 



C.6.4 Experimental Tests and Final Remarks 

The test case used in this study is given by a sample X assumed to follow a mixture of 
bivariate normal distributions with unknown parameters, including the number of com- 
ponents. X is the Iris virginica data set, with sepal and petal length of 50 specimens (1 
discarded outlier). The botanical problem consists of determining whether or not there 
are two distinct subspecies in the population, see Anderson (1935), Fisher (1936) and 
McLachlan (2000). Figure 1 presents the dataset and posterior density level curves for 
the parameters, 9* and 9, optimized for the 1 and 2 component models. 
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Figurel: Iris virginica data and models with one (left) and two (right) components 



In the FBST formulation of the problem, the 2 components is the base model, and the 
hypothesis to be tested is the constraint of having only 1 component. When implementing 
the FBST one has to be careful with trapping states on the MCMC. These typically are 
states where one component has a small number of sample points, that become (nearly) 
coUinear, resulting in a singular posterior. This problem is particularly serious with the 
Iris dataset because of the small precision, only 2 significant digits, of the measurements. 
A standard way to avoid this inconvenience is to use flat or minimally informative priors, 
instead of non-informative priors, see Robert (1996). 

We used as flat prior parameters: y = 1, h = 1, it — u, e — 3, S — {l/n)S. Robert 
(1996) uses, with similar effects, e = 6, 5" = {l.b/n)S. 

The FBST selects the 2 component model, rejecting H, if the evidence against the 
hypothesis is above a given threshold, ev {H) > r, and selects the 1 component model, 
accepting H, otherwise. The threshold r is chosen by empirical power analysis, see Stern 
and Zacks (2002) and Lauretto et al. (2003). Let 6* and 6 represent the constrained (1 
component) and unconstrained (2 component) maximum a posteriori (MAP) parameters 
optimized to the Iris dataset. Next, generate two collections of t simulated datasets of size 
n, the first collection at 9*, and the second at 9. q;(t) and (3{t), the empirical type 1 and 
type 2 statistical errors, are the rejection rate in the first collection and the acceptance 
rate in the second collection. A small, t — 500, calibration run sets the threshold r so 
to minimize the total error, (Q;(r) + ^(t))/2. Other methods like sensitivity analysis, see 
Stern (2004a,b), and loss functions, see Madruga (2001), could also be used. 

Biernacki and Govaert (1998) studied similar mixture problems and compared sev- 
eral selection criteria, pointing as the best overall performers: AIC - Akaike Information 
Criterion, AIC3 - Bozdogan's modified AIC, and BIC - Schwartz' Baycsian Information 
Criterion. These arc rcgularization criteria, weighting the model fit against the number 
of parameters, sec Pcrcira and Stern (2001). If A is the model log- likelihood, k its number 
of parameters, and n the sample size, then, 

AIC = -2A + 2k , AIC3 = -2A + 3k and BIC = -2A + Klog(n) . 

Figure 2 show a, (3, and the total error [a + /3)/2. The FBST outperforms all the 
rcgularization criteria. For small samples, BIC is very biased, always selecting the 1 
component model. AIC is the second best criterion, caching up with the FBST for sample 
sizes larger than n = 150. 
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Figure 2: Criteria 0= FBST, X= AIC, += AIC3, *= BIG, 
Type 1, 2 and total error rates for different sample sizes 

Finally, let us point out a related topic for further research: The problem of discrimi- 
nating between models consists of determining which of m alternative models, fk{x,i/jk), 
more adequately fits or describes a given dataset. In general the parameters i/jk have 
distinct dimensions, and the models fk have distinct (unrelated) functional forms. In this 
case it is usual to call them "separate" models (or hypotheses). Atkinson (1970), although 
in a very different theoretical framework, was the first to analyse this problem using a 
mixture formulation, 

Em 

The general theory for mixture models presented in this article can be adapted to 
analyse the problem of discriminating between separate hypotheses. This is the subject 
of the authors' ongoing research with Carlos Alberto de Braganga Pereira and Basilio de 
Braganga Pereira, to be presented in forthcoming articles. 

The authors are grateful for the support of CAPES - Coordenagao de Aperfeigoamento 
de Pessoal de NiVel Superior, CNPq - Conselho Nacional de Desenvolvimento Cientifico 
e Tecnologico, and FAPESP - Fundagao de Apoio a Pesquisa do Estado de Sao Paulo. 



C.7 REAL Classification Trees 

This section presents an overview of REAL, The Real Attribute Learning Algorithm for 
automatic construction of classification trees. The REAL project started as an application 
to be used at the Brazilian BOVESPA and BM&F financial markets, trying to provide a 
good algorithm for predicting the adequacy of operation strategies. In this context, the 
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success or failure of a given operation strategy corresponds to different classes, and the 
attributes are real-valued technical indicators. The users demands for a decision support 
tool also explain several of the algorithm's unique features. 

The classification problems are stated as an n x (m + 1) matrix A. Each row, A{i, :), 
represents a different example, and each column, A{:,j), a different attribute. The first 
m columns in each row are real- valued attributes, and the last column , A{i,m + 1) is 
the example's class. Part of these samples, the training set, is used by the algorithm to 
generate a classification tree, which is then tested with the remaining examples. The error 
rate in the classification of the examples in the test set is a simple way of evaluating the 
classification tree. 

A market operation strategy is a predefined set of rules determining an operator's 
actions in the market. The strategy shall have a predefined criterion for classifying a 
strategy application as success or failure. 

As a simple example, let us define the strategy buysell{t, d, I, u, c): 

• At time t buy a given asset A, at its price p{t). 

• Sell A as soon as: 

1. t' ^t + d ,or 

2. p{t') = p{t) * (1 + u/100) , or 

3. p{t')^p{t) * (1 - //lOO) . 

• The strategy application is successful if c < 100 ^ pit') /{pit) < u 

The parameters u, I, c and d can be interpreted as the desired and worst accepted returns 
(low and upper bound), the strategy application cost, and a time limit. 

Tree Construction 

Each main iteration of the REAL algorithm corresponds to the branching of a terminal 

node in the tree. The examples at that node are classified according to the value of a 
selected attribute, and new branches generated to each specific interval. The partition 
of a real-valued attribute's domain in adjacent non-overlapping (sub) intervals is the 
discretization process. Each main iteration of REAL includes: 

1. The discretization of each attribute, and its evaluation by a loss function. 

2. Selecting the best attribute, and branching the node accordingly. 

3. Merging adjacent intervals that fail to reach a minimum conviction threshold. 
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C.7.1 Conviction, Loss and Discretization 

Given a node of class c with n examples, k of which are misclassified and (n — k) of 
which are correctly classified, we needed a single scalar parameter, cm, to measure the 
probability of misclassification and its confidence level. Such a simplified conviction (or 
trust) measure was a demand of REAL users operating at the stock market. 

Let q be the misclassification probability for an example at a given node, let p = {1 — q) 
be the probability of correct classification, and assume we have a Bayesian distribution 
for q , namely 

D{c) = Pr{q < c) = Pr{p > 1 - c) 
We define the conviction measure: 100 * (1 — cm)%, where 

cm — min c | Pr{q < c) > 1 — g{c) 

and g{ ) is a monotonically increasing bijection of [0, 1] onto itself. Prom our experience 
in the stock market application we learned to be extra cautious about making strong 
statements, so we make 5'( ) a convex function. 

In this paper D{c) is the posterior distribution for a sample taken from the Bernoulli 
distribution, with a uniform prior for q: 

B{n, k, q) = comb{n, k) * q'' * p^~'' 

D[c,n,k) — I B{n,k,q) / I B{n,k,q) 

Jq=0 Jq=0 

= betainc(c, k + l,n — k + 1) 

Also in this paper, we focus our attention on 

g{c) ^ g{c,r) ^ c\ r > 1.0 

we call r the convexity parameter. 

With these choices, the posterior is the easily computed incomplete beta function, and 
cm is the root of the monotonically decreasing function: 

cm{n,k,r) = c \ /(c) = 

/(c) = l-g{c)-D{c,n,k) 

= 1 — c^ — betainc(c, k + l,n — k + 1) 

Finally, we want a loss function for the discretizations, based on the conviction mea- 
sure. In this paper we use the overall sum of each example classification conviction, that 
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is, the sum over all intervals of the interval's conviction measure times the number of 
examples in the interval. 

loss = Hi * crrii 

i 

Given an attribute, the first step of the discretization procedure is to order the ex- 
amples in the node by the attribute's value, and then to join together the neighboring 
examples of the same class. So, at the end of this first step, we have the best ordered 
discretization for the selected attribute with uniform class clusters. 

In the subsequent steps, we join intervals together, in order to decrease the overall 
loss function of the discretization. The gain of joining J adjacent intervals, Ih+i, Ih+2, 
. . . Ifi+j , is the relative decrease in the loss function 

gain{h,j) = '^^loss{nj, kj,r) — loss{n, k,r) 

3 

where n = nj and k counts the minorities' examples in the new cluster (at the second 
step kj = 0, because we begin with uniform class clusters). 

At each step we perform the cluster joining operation with maximum gain. The 
discretization procedure stops when there are no more joining operations with positive 
gain. 

The next examples show some clusters that would be joined together at the first step of 
the discretization procedure. The notation (n, k, m, r, ±) means the we have two uniform 
clusters of the same class, of size n and m, separated by a uniform cluster of size A; of a 
different class; r is the convexity parameter, and + (— ) means we would (not) join the 
clusters together. 

( 2,1, 2,2,+) 

( 6,2, 1,2,-) ( 6,2, 8,2,+) ( 6,2,23,2,+) ( 6,2,24,2,-) 

( 7,2, 6,2,-) ( 7,2, 7,2,+) ( 7,2,42,2,+) ( 7,2,43,2,-) 

(23,3,23,2,-) (23,3,43,2,-) (23,3,44,2,+) 

(11,3,13,3,-) (11,3,14,3,+) (11,3,39,3,+) (11,3,40,3,-) 

(12,3,12,3,-) (12,3,13,3,+) (12,3,54,3,+) (12,3,55,3,-) 

In these examples we see that it takes extreme clusters of a balanced and large enough 
size, n and m, to "absorb" the noise or impurity in the middle cluster of size k. A larger 
convexity parameter, r, implies a larger loss at small clusters, and therefore makes it 
easier for sparse impurities to be absorbed. 

C.7.2 Branching and Merging 

For each terminal node in the tree, we 
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1. perform the discretization procedure for each available attribute, 

2. measure the loss function of the final discretization, 

3. select the minimum loss attribute, and 

4. branch the node according this attribute discretization. 

If no attribute discretization decreases the loss function by a numerical precision threshold 
e > 0, no branching takes place. 

A premature discretization by a parameter selected at a given level may preclude 
further improvement of the classification tree by the branching process. For this reason 
we establish a conviction threshold, ct, and after each branching step we merge all adjacent 
intervals that do not achieve cm < ct. To prevent an infinite loop, the loss function value 
assigned to the merged interval is sum of the losses of the merging intervals. At the final 
leaves, this merging is undone. The conviction threshold naturally stops the branching 
process, so there is no need for an external pruning procedure, like in most TDIDT 
algorithms. 

In the straightforward implementation, REAL spends most of the execution time 
computing the function c'm{n,k,r). We can greatly accelerate the algorithm by using 
precomputed tables of cm{n, k, r) values for small n, and precomputed tables of cm{n, k, r) 
polynomial interpolation coefficients for larger n. To speed up the algorithm we can also 
restrict the search for join operations at the discretization step to small neighborhoods, 
i.e. to join only 3 < J < Jmax clusters: Doing so will expedite the algorithm without 
any noticeable consistent degradation. 

For further details on the numerical implementation, benchmarks, and the specific 
market application, see Lauretto et al. (1998). 



Appendix D 

Deterministic Evolution and 
Optimization 

This chapter presents some methods of deterministic optimization. Section 1 presents the 
fundamentals of Linear Programming (LP), its duahty theory, and some variations of the 
Simplex algorithm. Section 2 presents some basic facts of constrained and unconstrained 
Non- Linear Programming (NLP), the Generalized Reduced Gradient (GRG) algorithm 
for constrained NLP problems, the Par Tan method for unconstrained NLP problems, 
and some simple line search algorithms for uni-dimensional problems. Sections 1 and 2 
also presents some results about these algorithms local and global convergence properties. 
Section 3 is a very short introduction to variational problems and the Euler-Lagrange 
equation. 

The algorithms presented in sections 1 and 2 are within the class of active set or active 
constraint algorithms. The choice of concentrating on this class is motivated by some 
properties of active set algorithms, that makes them specially useful in the applications 
concerning the statistics, namely: 

- Active set algorithms maintain viability throughout in the search path for the optimal 
solution. This is important if the objective function can only be computed at (nearly) 
feasible arguments, as it is often the case in statistics or simulation problems. This feature 
also makes active set algorithms relatively easy to expalain and implement. 

- The general convergence theory of active set algorithms and the analysis of specific 
problems may offer a constructive proof of the existence or the verification of stability 
conditions for an equilibrium or fixed point, representing a systemic eigen-solution see, 
for example. Border (1989), Ingrao and Israel (1990) and Zangwill (1964). 

- Active set algorithms are particularly efficient for small or medium size re-optimization 
problems, that is, for optimization problems where the initial solution or staring point for 
the optimization procedure is (nearly) feasible and already close to the optimal solution. 
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so that the optimization algorithm is only used to finetune the solution. In FBST applica- 
tions, such good staring points can be obtained from an exploratory search performed by 
the Monte Carlo or Markov Chain Monte Carlo procedures used to numerically integrate 
the FBST e-value, ev(if), or truth function W{v), see appendices A and G. 

D.l Convex Sets and Polyedra 

The matrix notation used in this book is defined at section F.l. 
Convex Sets 

A point y{l) is a convex combination of m points of R"^, given by the columns of matrix 
X, n X m, iff 

rn rn 

Vi, y{l)i^Y.^j*Xi , lj>0 I $^/, = l, 
i=i j=i 
or, equivalently, in matrix notation, iff 

m m 

i=i j=i 
or, in yet another equivalent form, replacing the summations by inner products, 

y{l) ^Xl , l>0\l'l^l . 

In particular, the point y{X) is a convex combination of two points, z e w, if 

y{X) ^ {1- X)z + Xw , A e [0, 1] . 

Geometrically, these are the points in the line segment from z to w. 

A set, C G i?", is convex iff it contains all convex combinations of any two of its points. 
A set, C G R'\ is bounded iff the distance between any two of its points is bounded: 

3S I Vxl, x2eC , \\xl-x2\\ < S 

Figure D.l presents some sets exemplifying the definitions above. 

An extreme point of a convex set C is a point x E C that can not be represented as a 
convex combination of two other points of C. The profile of a convex set C, ext(C), is the 
set of its extreme points. The Convex hull and the closed convex hull of a set C, ch(C) 
and cch(C), are the intersection of all convex sets, and closed convex sets, containing C. 

Theorem: A compact (closed and bounded) convex set is equal to the closed convex 
hull of its profile, that is, C — cch(ext(C)). 
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Figure D.l: (a) non-convex set, (b.c) bounded and unbounded polyedron, 
(d-f ) degenerate vertex perturbed to a single or two nondegenerate ones. 



The epigraph of a curve in R^, y = f{x), x G [a,b], is the set defined as epig(/) = 
{{x, y) \ X E [a,b] A y > f{x)}. A curve is said to be convex iff its epigraph is convex. A 
curve is said to be concave iff —f{x) is convex. 

Theorem: A curve, y = f{x), R R, that is continuously differentiable and has 
monotonically increasing first derivative is convex. 

Theorem: The convex hull of a finite set of points, V, is the set of all convex combina- 
tions of points of V, that is, if 1/ = {x\ i = 1 . . . n}, then ch(l^) — {x \ x — [x^, . . . x"']l, I > 
0, 17 = 1}. 

A (non-linear) constraint, in i?", is an inequality of the form g{x) < 0, g : R"' ^ R. 
The feasible region defined by m constraints, g{x) < 0, g : i?" i— > R"\ is the set of feasible 
(or viable) points {x \ g{x) < 0}. At the feasible point x, the constraint gi{x) is said to 
be active or tight if the equality, gi{x) = 0, holds, and it is said to be inactive or slack if 
the strict inequality, gi{x) < 0, holds. 
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Polyedra 

A polyedron in i?" is a feasible region defined by linear constraints: Ax < d. We can 
always compose an equality constraint, a'x — 5 with two inequality constraints, a'x < 5 
e a'x > S. 

Theorem: Polyedra are convex, but not necessarily bounded. 

A face of dimension k, of a polyedron in i?" with m equality constraints, is a feasible 
region that obeys tightly to n — m — /c of the polyedron's inequality constraints. Equiv- 
alently, a point that obeys to r active inequality constraints is at a face of dimension 
k — n — m — T. A vertex is a face of dimension 0. An edge is a face of dimension 1, 
An interior point of the polyedron has all inequality constraints slack or inactive, that is, 
k — n — m. A facet is a face of dimension k — n — m — 1. 

It is possible to have a point in a face of negative dimension. For example. Figure D.l 
shows a point where n — m + 1 inequality constraints are active. This point is "super 
determined", since it is a point in i?" that obeys to n+1 equations, m equality constraints 
and n — m + 1 active inequality constraints. Such a point is said to be degenerate. From 
now on we assume the non-degenerescence hypothesis, stating that such points do not 
exist in the optimization problem at hand. This hypothesis is very reasonable, since the 
slightest perturbation to a degenerate problem transforms a degenerate point into one or 
more vertices, see Figure D.l. 

A polyedron in standard form, PA,d C BT', is defined by n signal constraints, > 0, 
and m < n equality constraints, that is, 

PA,d ^ {x >0 \ Ax ^ d} , A mx n 

We can always rewrite a polyedron in standard form (at a higher dimensional space) 
using the following artifices: 

1. Replace an unconstrained variable, Xi by the difference of two positive ones, x'l — x~ 
where x'l — max{Q,Xi} e x~ — max{0, —Xi}. 

2. Add a slack variable, x > to each 
inequality 

a'x<5<^ [a 1 ] 

From the definition of vertex we can see that, in a polyedron in standard form, PA,d, 
a vertex is a feasible point where n — m constraints are active. Hence, n — m variables 
are null; these are the residual variables of this vertex. Let us permute the vector x so 
to place the residual variables at the last n — m positions. Hence, the remaining (non- 
null) variables, the basic variables will be at the first m positions. Applying the same 
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permutation to the columns of matrix A, the block of the first m columns is called the 
basis, B, of this vertex, while the block of the remaining n — m columns of A is called the 
residual matrix, R. That is, given vectors b and r with the basic and residual indices, the 
permuted matrix A can be partitioned as 

[A'' A' ]^[B R] 

In this form, it is easy to write the non-null variables explicitely, 

Xb 

Xb — [d — Rxr\ 
Equating the residual variables to zero, it follows that 

Xb = B-'^d . 



Xb 



d hence. 



Prom the definition of degenerescence we see that the vertex of a polyedron in standard 
form is degenerate iff it has a null basic variable. 



D.2 Linear Programming 

This section presents Linear Programming, the simplest optimization problem studied in 
multi-dimensional mathematical programming. The simple structure of LP allows the 
formal development of relatively simple solution algorithms, namely, the primal and dual 
simplex. This section also presents some decomposition techniques used for solving LP 
problems in special forms. 



D.2.1 Primal and Dual Simplex Algorithms 

A LP problem in standard form asks for the minimum of a linear function inside a polye- 
dron in standard form, that is, 

min cx, x >Q \ Ax — d . 



Assume we know which are the residual (zero) variables of a given vertex. In this 
case we can form basic and residual index vectors, b and r, and obtain the basic (non- 
zero) variables of this vertex. Permuting and partitioning all objects of the LP problem 
according to the order established by the basic and residual index vectors, the LP problem 
is written as 



mm 



Xb 



X 



> I [ S R] 



Xb 



d 
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using the notation 

d = B-^d and R = B'^R , 

the basic solution corresponding to this vertex is x^ — d (and = 0). 

Let us now proceed with an analysis of the sensitivity of this basic solution by a 
perturbation of a single residual variable. If we change a single residual variable, say the 
j-th element of x^., allowing it to become positive, that is, making Xj-^j) > 0, the basic 
solution, Xb becomes 

Xb — d — RXr 

— d — Xr{j)R'' 

This solution remains feasible as long as it remains non-negative. Using the non- 
degcncrcgcsccnce hypothesis, d > 0, and wc know that it is possible to increase the value 
of Xr(j)-i while keeping the basic solution feasible, up to a threshold e > 0, when some 
basic variable becomes null. 

The value of this prturbed basic solution is 

cx — C^Xi + d'xr 

= C^B-^[d- RXr]+d'Xr 

= c^d+{d' -c^R)xr 

= (/? — ZXr 
— </? — Z-'Xr{j) 

Vector z is called the reduced cost of this basis. 

The sensitivity analysis suggests the following algorithm used to generate a sequence 
of vertices of decreasing values, starting from an initial vertex, [xf, \ Xr] ■ 

Simplex Algorithm: 

1. Find a residual index j, such that z^ > 0. 

2. Compute, ior k e K = {I \ Rj > 0} , = dk/Ri , 
and take i e Argmin^^g^^ , i.e. e(i) = min^ . 

3. Make the variable Xr{j) basic, and Xb(^f) residual. 

4. Go back to step 1. 

The simplex can not proceed if z < at the first step, or if, at the second step, the 
mimimum is taken over the empty set. In the second case the LP problem is unbounded. 
In the first case the current vertex is an optimal solution! 
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Changing the status basic / residual of a pair of variables is, in the LP jargon, to 
pivot. After each pivoting operation the basis inverse needs to be recomputed, that is, the 
basis needs to be reinverted. Numerically efficient implementation of the Simplex do not 
actually keep the basis inverse, instead, the basis inverse is represented by a numerical 
factorization, like B — LU or B = QR. At each pivot operation the basis is changed by a 
single column, and there are efficient numerical algorithms used to update the numerical 
factorization representing the basis inverse, see Murtagh (1981) and Stern (1994). 

Example 1: Let us illustrate the Simplex algorithm solving the following simple ex- 
ample. 

Let us consider the LP problem min[— 1, — l]x, < x < 1. 
This problem can be restated in standard form: 



[-1-100] A 
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1 _ 



The initial vertex x = [0,0] is assumed to be known. 

Step 1: r = [1,2], b= [3,4], B = A{:,b) = I, R = A{:,r) = J, 

-z = c'- c'R = [-1, -1] - [0, 0] ^ z = [1, 1], J = 1, r(j) = 1, 



xj, — d — eR? — 
Step 2: r = [3, 2 



— e 



6= [1,4 



e* = 1, i = 1, b{i) = 3 



B = A{:,b) = I,R = A{:,r)=I, 



-z^e- &R = [0, -1] - [-1, 0] ^ ^ = [-1, 1], J = 2, r{j) = 2, 



Xb — d — eR^ = 
Step 3: r = [3, 4 



— e 



,6= [1,2 



e* = 1, i = 2, h{i) = 4 




1 

S = A(:,6) = /, i? = A(:,r) = /, 



-z^c^ - d'R = [0, 0] - [-1, -1] ^ z = [-1, ~1] < 



Obtaining the initial vertex 

In order to obtain an initial vertex, used to start the simplex, we can use the auxiliary 
LP problem. 



mm 



[0 1] 



X 




X 


. y . 




. y . 



> A [ ^ diag(sign((i)) ] 



X 

y 



An initial vertex for the auxiliary problem is given by [ ahs{d') ] . If the auxiliary 
problem has an optimal solution of value zero, the optimal solution gives a feasible vertex 
for the original problem; 

if not, the original problem is unfeasible. 
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Duality 

Given an LP problem, called the primal LP problem, we define a second problem, the dual 
problem (of the primal problem) . Duality theory establishes important relations between 
the solution of the primal LP and the solution of its dual. 

Given a LP problem in canonic form, 

min cx \ X > A Ax > d , 
its dual LP problem is defined as 

max y'd | y > A y'A < c . 

The primal and dual problems in canonic form have an intuitive economical interpre- 
tation. The primal problem can be interpreted as the classic ration problem: A^ is the 
quantity of nutrient of type j found in one unit of aliment of type i. c* is the cost of 
one unit of aliment i, and dj the minimum daily need of nutrient j. The primal optimal 
solution is a nutritionally feasible ration of minimum cost. The dual problem can be 
interpreted as a manufacturer of synthetic nutrients, looking for the "market value" for 
its nutrients line. The manufacturer income per synthetic ration is the objective function 
to be maximized. In order to keep its line of synthetic nutrients competitive, no natural 
aliment should provide nutrients cheaper than the corresponding synthetic mixture, these 
are the dual problem's constraints. The optimal prices for the synthetic nutrients, y* can 
also be interpreted as marginal prices, giving the differential price increment of aliment i 
by differential increase of its content of nutrient j. The correctness of these interpretations 
are demonstrated by the duality properties discussed next. 

Lemma 1: The dual of the dual is the primal PL problem. 

Proof: Just observe that the dual of the primal LP in canonic form is equivalent to 

min — y'd \ y > A —y'A > —c . 

This problem is again in canonic form, and can be immediately dualized, yielding a 
problem equivalent to the original LP problem. 

Weak Duality Theorem: If x and y are, respectively, feasible solutions of the primal 
and dual problems, then there is a non-negative gap between their values as solutions of 
these problems, that is, 

cx > y'd . 

Proof: By feasibility. Ax > d and y > 0. Hence, y'Ax > y'd. In the same way, y'A < c 
and X >0. Hence y'Ax < cx. Therefore, cx > y'd. QED. 
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Corollary 1: If we have a pair of feasible solutions, x for the primal LP problem and 
y for its dual, and their values as primal and dual solutions coincide, that is, cx* = {y*yd, 
then both solutions are optimal. 

Corollary 2: If the primal LP problem is unbounded, its dual is unfeasible. 

As we could re-write any LP problem in standard for, we can re-write any LP problem 
in canonical form. Hence, from Lemma 1, we know that the duality relation is defined 
between pairs of LP problems, whatever the form they have been writen. 

Lemma 2: Given a primal in standard form. 



Theorem (Simplex proof of correctness): 

We shall prove that the Simplex stops at an optimal vertex. At the Simplex halting 
point we have z = —[d' — c^B~^R) < 0. Let us consider y' = c''B~^ as a candidate solution 
for the dual. 



Hence, y e is a feasible dual solution. Moreover, its value (as a dual solution) is y'd — 
c^B~^d = c'^d = ip, and, by corollary 1, both solutions are optimal. 

Theorem (Strong Duality): If the primal problem is feasible and bounded, so is its 
dual. Moreover, the value of the primal and dual solutions coincide. 

Proff: Constructive, by the Simplex algorithm. 

Theorem (Complementary Slackness): Let x and y' be feasible solutions to a LP in 
standard form and its dual. These solutions are optimal iff w'x = 0, where w = (c — y'A). 
The vectors x and w represent the slackness in the inequality constraints of the primal 
and dual LP problems. Since x > and w > 0, the scalar product w'x is null iff each 



min cx \ X > A Ax — d 



its dual is 



max y'd \ y e R"^ Ay'A < c . 




-y'[B R] 
cT ] _ c^B-^ [BR] 
]-c^[l R] 
]-[c'' c^R ] 
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of its terms, WjXj, is null; or equivalently, if for each slack inequality constraint in the 
primal, the corresponding inequality constraint in the dual is tight, and vice- versa. Hence 
the name complementary slackness. 

Proof: If the solutions are optimal, we could have obtained them by the Simplex 
algorithm. As in the Simplex proof of correctness, let 



(c - y'A)x = [ ^ ] 



Xb 









If (c — y'A)x = 0, then y'{Ax) = cx, or y'd = cx, 

and by the first corollary of the weak duality theorem, both solutions are optimal. 

General Form of Dualidade 

The following are LP problem of the most general form and its dual. An asterisk, *, 
indicates an unconstrained sub-vector 



The general Primal LP problem, 

min [ c{l} c{2} c{3} ] 



M\} A{1} 
Ml} Ml} 

I Ml} Ml} Ml} 



and its Dual LP problem: 

max [ d{l}' d{2}' d{3y ] 



x{l} 
x{2} 
x{3} 



x{l} > 
x{2} * 
x{3} < 





-x{l}- 


< d{l} 




x{2} 


= 42} 




. ^{3} . 


> 43} 



y{i} 

l/{2} 

Ly{3} 



M\} Ml} Ml} 
Ml} Ml} Ml} 
Ml} Ml} Ml}} 



y{i} < 

y{2} * 
2/{3} > 

< c{iy 

= c{2}' 
> c{3}' 



The following are some interesting special cases: 



Primal: max cx \ Ax > d A x > Dual: min y d \ y A > c Ay < 
Primal: max cx \ Ax < d Ax & R"' Dual: max y d \ y A — c Ay < 
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Dual Simplex Algorithm 



The Dual Simplex algorithm is analogous to the standard Simplex, but it works caring a 
basis that is dual feasible, and works to achieve primal feasibility. The Dual Simplex is 
very useful in several situations in which we solve a LP problem, and subsequently have 
to alter some constraints, loosing primal feasibility. We work with a standard LP program 
and its dual, 

P : min cx , x > Ax = d and D : maxT/'d , y' A < c 

In dual feasible basis, y — c^B~^ is a dual feasible solution, that is 

[ d- ]-y'[B R] 

= [c^ ] - c''B-^ [BR] 

= [ ]-c''[l R] 

= [ c'' ]-[c'' c'^R] 

= [0 -z]>0 



Now, let us rewrite the dual in a form that is analogous to the standard form, adding 
slack variables and using a partition of the coefficient matrix, [B,R], as follows: 



maxrf'y 
max d'y 

max d'y 



A'y < c' 

' B' " 
R' 



< 



B' I 
R' I 



y 

Wb 
Wr 



, w > 



In this form, the dual basis, its inverse, and corresponding basic solution are given by. 



y 




Wr 




y -- 


= B- 





' B' 




5-* " 








R' 


I 




-R'B-^ I 






' B-' 


" 










" 




' I ' 


-R'B- 










-R'B-^ I 







- B-'wb 


e 


Wr = C 


" - R'B- 


-'c'' + R'B- 



wj, I.e. 



Note that the indices in b and r correspond to basic and residual indices in the primal, 
the situation being reversed in the dual. As in the standard Simplex, we can increase a 
zero element of the residual vector in order to have a better dual solution. 



d'y = d'B *(c — Wb) — const — d'wb 
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If c? > the primal basic solution is feasible, and we have the optimal solutions for both 
the primal and the dual problems. If there is an element di < 0, we can increase the 
value of the dual solution increasing We can increase = i', without loosing 

dual feasibility, as long as we maintain 

Wr = c''' - R!B-^c^' + uR'B-^r > transposing 

- c^R + vB-^R > i.e. 

-z + iyRi>0 

Making j — argmin{i/(j) = /Rj , j \ R-l < 0}, we have the index that leaves the 
dual basis. Hence, in the new list of indices 6, that are primal basic, we can exclude 
include r(j), update the basis' inverse and proceed to a new dual simplex iteration, until 
we reach dual optimality or, equivalently, primal feasibility. 

D.2.2 Decomposition Methods 

Suppose we have a LP problem in the form 

r A 

mincx, x>Q\Ax — h, where the matrix A= •■ , 

' L ^ J 

and the polyedron described by Ax — b has a very "simple" structure, while Ax = b 
implies only a "few" additional constraints that, unfortunately, greatly complicate the 
problem. 

For example, let Ax — b describe a set of separate LP problems, while Ax — b imposes 
global constrains coupling the variables of the several LP problems. This structure is 
known as Row Block Angular Form (RBAF), see section 5.2. 

We now study the Danzig- Wolf method, that allow us to solve the original LP problem, 
by successive iterations between a "small" main or master problem, and a large but 
"simple" subproblem or slave problem. We assume that the simple polyedron is bounded, 
hence being the convex hull of its vertices 

X ^ {x>0 \ Ax ^b} ^ ch{V) ^Vl, Z > | I'Z = 1 . 
The origianl LP problem is equivalent to the following master problem: 
M : mincVl , l>0 

obviously this representation has only theoretical interest, for it is not practical to find 
the many vertices oiV. A given basis B is optimal iff 



AV 

r 



-z = [cV]r - {[cV]bB-')R = [cV]r - [y,j]R > 
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This condition is equivalent of having, for any residual index, j, 



cV^ -[y,j] 



1 



> , or 



7 < cV^ - yAV^ = (c - yA)V^ , or 
7 < min(c — yA)v , v & X 

Hence, we define the o sub-problem 

S : min(c — yA)v , v >0 \ Av — b 

If the optimal solution of S, v* has optimal value (c — yA)v* > 7, the basis B is optimal 

Av* 

for M. If not, v* give us the next column for entering the basis, ^ 

The optimal solution of the auxiliary problem also give us a lower bound for the original 
problem. Let x be any feasible solution for the original problem, that is, x E X \ Ax — b. 
Since x is more constrained, (c — yA)x > (c — yA)v*, hence, cx > yb + {c — yA)v*. Note 
that yb is the current upper bound. Also note that it is not necessary to have a monotonic 
increase in the lower bound. Hence we must keep track of the best lower bound found so 
far. 

As we have seen, the Danzig-Wolf works very well for LP problems in RBAF. If 
we had a problem in CBAF - Column Block Angular Form, we could use Danzig- Wolf 
decomposition method on the problem's dual. This is essentially Benders decomposition 
method, that can be efficiently implemented using the Dual Simplex algorithm. 



Exercises 

1. Geometry and simple lemmas: 

a- Draw the simplex, S'„, and the cube, C„ of dimension 2 and 3. S = {x > \ 
I'x <1},C = {x>0\lx< 1}. 

b- Rewrite 5*2, 5*3, C2 and C3 as standard form plyedra in K^, where n — 3,4,4,6, 
respectively. 

c- Prove that a polyedron (in standard form) is convex, 
d- Prove duality lemmas 1 and 2. 

e- Prove that a bounded polyedron is the set of convex combinations of its vertices. 

2. Write a program to solve a LP problem in standard form by exhaustive enumeration 
of its vertices. Suggestion: Write a function to enumerate all arrangements, b = 
[6(1), . . . b{m)], of m indices from 1 : n, in increasing order that is, b{j) > b{i) for 
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j > i. For each arrangement, b form the basis B = A''. Check if B is invertible and, 
if so, check if the basic sohition is a vertex, that is, if it is feasible, d = B~^d > 0. 
Compute the value of all feasible basic solutions, and select the best one. 

3. Adapt the Simplex algorithm to use the QR factorization of the basis. Explain how- 
to update the factorization after a pivoting operation. 

4. Adapt and implement the Simplex for LP problems with box constraints, that is 

min cx , I < x <u \ Ax = d . 

Hint: Consider a given feasible basis, 5, and a partition [ i? R 5* ] , where 

Ih < Xb < Uh, Xr = Iri Xg = Us, SO that, 

Xb = B-^d - B'^Rxr - B-^Sxs and 

cx = c^B-^d + (c^ - c^B-^R)xr + (c" - c^B-'^S)xs ^ <p + z'^Xr + z'Xs 

If z^^^^ < 0, we can improve the current solution increasing this residual variable 
residual at its lower bound, Xr(k) — lr{k) + ^r{k), niaking 
Xb = B-^d - B-^Rlr - B-^Sus - Sr(k)B-'^R''. 
However, Sr{k) shall respect the following bounds: 

1- Xr(k) = lr{k) + ^r(fc) < Ur{k), 2- Xb > h, 3- Xb < Ub- 

In a similar way, if z<^'^ > 0, we can improve the current solution decreasing this 
residual variable at its upper bound, Xs(k) = Us(k) — Ss(k)- 

5. Adapt and implement the Dual Simplex for LP problems with box constraints. 

6. Implement Danzig- Wolf decomposition methods for RBAF problems. 

D.3 Non-Linear Programming 

Optimality and Lagrange Multipliers 

We start this section giving an intuitive explanation of Lagrange's optimality conditions 
for a Non-Linear Programming (NLP) problem, given as 

min/(x), X I g{x) < A h{x) = , f : R"" ^ R , g : R"" ^ R"" , h : R"" ^ R^ . 

We can imagine the function / as potential, or the "height" of a surface. An equipo- 
tential is a manifold where the function is constant, f{x) = c. The gradient 

Vf = df/dx^[df/dx^, df/dx2, ... df/dXn] 

gives steepest ascent direction of the function at point x. Hence, the gradient V/(a;) is 
orthogonal to the equipotential at this point. 
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Imagine a particle being "pulled down" by the force — V/(x). The optimal solution 
must be a point of equilibrium for the particle. Hence, either the force pulling the particle 
down is null, or else the force must be equilibrated by "reaction" forces exercised by the 
constraints. The reaction force exercised by an inequality constraint gi{x) < 0, must obey 
the following conditions: 

a) Be a force orthogonal to the equipotential curve of this constraint (since only the 
value of gi{x) is relevant for this constraint); 

b) Be a force pulling the particle "inwards" , that is, to the inside of the feasible region; 

c) Moreover, a inequality constraint can only exercise a reaction force if it is tight, 
otherwise there is a slack allowing the particle to move even closer to this constraint. 

An equality constraint, hi{x) — 0, can be seen as a pair of inequality constraints, hi{x) < 
and hi{x) > 0, but unlike an inequality constraint, an equality constraint is always active. 

Our intuitive discussion can be summarized analytically by the following conditions 
known as Lagrange's optimality conditions: 

If X* e e is an optimal point, then 

^ueR"" , veR'' \ uVg{x*) + vVh{x*) - Vf{x*) = , onde u<OA ug{x*) = . 

The condition u < implies that the inequality's reaction force points to the inside of 
the feasible region, while the complementarity condition, ug = 0, implies that only active 
constraints can exercise reaction forces. The vectors u and v are known as Lagrange 
multipliers. 

Quadratic Programming 

As an example, let us derive the Lagrange optimality conditions for Quadratic Program- 
ming (QP). QP is an important problem in its own right, and is also frequently used 
as a subproblem in methods designed to solve more general problems like, for example. 
Sequential Quadratic Programming, see Luenberger (1984) and Minoux and Vajda (1986). 

The QP problem with linear constraints is stated as 

minf{x) = {l/2)x'Qx — r]p'x \ x > A Te * x = te A Tl * x < tl 

where the matrix dimensions are Te me x n, me < n, Tl ml x n, ml < n, Ml — 
1,2,... ml, Me = 1,2,... me, N = 1,2, . . .n. We assume that the quadratic form defining 
the problem is symmetric and positive definite, that is, Q — Q' , Q > 0. 

In the QP problem above, the objective function's gradient is 



V/ = x'Q - r]p' , 
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and the gradients of the constraint functions are 

gi{x) = TiX <ti^ Vgi = % . 

Hence, the Lagrange optimahty conditions are 

xeRl,seR\,le , e e R"'\ \ -{x'Q - rjp') + s' - I'Tl + e'Te = 

A^ie N , XiSi = A yke Ml , {Tl * x - tVjkh = or 
xeRl,seR\,le , e e ?/ e i?f \ Qx - s' + TI' *l + Te'e = r]p 

A^i e N , XiSi = A yk e Ml , ykk = onde yl = {tl -Tl*x) 

The Complementarity Conditions (CC), x's = e yl'l — 0, indicate that only active 
constraints can help to equilibrate non-negative components of the objective function's 
gradient. Using the change of variables e = ep — em, ep, em > 0, the optimal solution is 
characterized by the optimality and feasibility conditions 



X 

I 

ep 

en 
s 

yl 



> 



Tl 
Te 
Q Tl' Te' 



/ 

-Te' -I 





X 






I 




' tl ' 




ep 




te 




en 








s 




. VP . 




. yl . 





x's = 0, yl'l = 0. 



Observe the formal resemblance of the last feasibility and optimality conditions to 
a LP in standard form where the non-linearity of the problem is encapsulated in the 
complementarity conditions. These observations are the key to adapt the Simplex to solve 
a QP, see Stern et al. (2006) and Wolfe (1959). This approach leads to efficient algorithms 
for Parametric Quadratic Programming and the computation of Efficient Frontiers, see 
Alexander and Francis (1986) and Markowitz (1952, 1956, 1987). 



D.3.1 GRG: Generalized Reduced Gradient 

Let us consider a NLP problem with non-linear equality constraints, plus box constraints 
over the variables' range, 

mmf{x) , f -.R^^R 
l<x<u I h{x) = , h:R''^R'^ 

The Generalized Reduced Gradient (GRG) method emulates the behaiviour of the 
Simplex method, for a local hnearization of the NLP problem, see Abadie and Carpentier 
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(1969) and Minoux and Vajda (1986), for an intuitive presentation see Himmelblau (1972). 
Let X be an initial feasible point. As for LP, we assume a non-degenerescence hypothesis, 
that is, we assume that, at a given feasible point, a maximum of (n — m) box constraints 
can be active. Hence, we can take m of the variables with slack box constraints as basic 
(or dependent) variables, and the remaining n — m variables as residual (or independent) 
variables. As in the Simplex algorithm, we permute and partition all vector and matrix 
objects to better display this distinction. 



v/(x) = [ w'fix) V7(^) ] 





Xb 


, / = 


" h ' 




Ub 


X = 


















Ur 



J{x) = [ J^(a;) J«(a;) ] 



V^hi{x) V/ii(x) 
V^h2{x) Wh2{x) 



Let us consider the effect of a small alteration to the current feasible point, x + 5, 
assuming that the functions / and h are continuous and differentiable. The corresponding 
alteration to the solution's value is 



A/ = fix + 6) -fix) « Vfix)6 = [V'fix) V'fix)] 



5r 



We also want the altered solution, a; + 5, to remain (approximately) feasible, that is, 
A/i = hix + 5)- hix) fti Jix) S = [ J\x) rix) ] =0 

Isolating 5^, and assuming that the basis J^(x) is invertible, 

5b ^ - {J\x))~' rix) Sr 
A/ « V'fix)Sb + Wfix)5r 

= (V'fix)-V'fix){j\x))-' rix)) 5r = Zix)5r 

Since the problem is non-linear, we can not assure that an optimal solution has all 
residual variables with one active constraint, that is, are at one side of the box, as in a 
standard LP problem. Therefore, there is no motivation to restrict 6r to have only one 
non-zero component, as in the Simplex. Instead, we suggest to move the current solution 
(in the space of residual variables) along the direction given by the vector Vr, opposed to 
the reduced gradient, as long as the corresponding box constraint is slack, that is. 



Mi) 



—z^ if 2;* > and Xr(i) > lr{i) 
—z^ if < and Xr[i) < Ur(i) 
otherwise 
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In subsection D.3.4 shall we give general conditions that assure global convergence 
for NLP algorithms, and we will sec that the discontinuity of vector Vr as a function of 
the box constraints slacks is undesirable. Hence, we shall use a continuous version of the 
search direction like, for example, 

{-l{Xr{i) - lr{i))z^ if > and Xr{i) > lr(i) 
-l{Ur{i) - Xr{i))z'' if < and Xr{i) < Ur({) 
otherwise 

where ^{x) = x/e, if < a; < e; and 7(2;) = 1, otherwise . 

The basic idea of one iteration of the GRG method is to move the feasible point by a 
step X + S with S = r)v, where Vb = — (J^(x)) J'''{x) v^, that is, a step (in the space of 
residual variables) of size 77 in the direction v^.. In order to determine the step size, 77, we 
need to perform a line search, always respecting the box constraints. 

Note that the direction, in the space of basic variables, v^, has been chosen so that 
X + rjv, remains (approximately) feasible, since we are moving inside a hyperplane that 
is tangent to the algebraic manifold defined by h{x) = 0. The new nearly feasible point, 
X, shall then receive a correction Ax in order to regain exact feasibility for the non-linear 
constraints, that is, so that h{x + Ax) ~ 0. The nearly feasible point x can be used a the 
starting point for a recursive method used to get exact feasiblity, like the Newton-Raphson 
method, that uses the basic Jacobian, J^{x), to compute the correction 

Axh = - {J''{x)) ^ h{Xh,Xr) . 

D.3.2 Line Search and Local Convergence 

This section analyses the problem of minimizing an unidimensional function, f{x). First, 
let us consider the problem of finding the root (zero) of a differentiable function, ap- 
proximated by its first order Taylor expansion, g{x) q{x'') + g'{x^){x — x^). This 
approximation implies that g{x^'^^) ~ 0, where 

^fe+i ^ ^fe _ g'{x'')-^g{x^) 
This is Newton's method, used to find the root of an unidimensional function. 

If a function f{x) is differentiable, its minimum is at a point where the function's first 
derivative is null. Hence, we can use Newton's method for minimizing f{x), 

Let us examine how fast the sequence generated by Newton's method approaches 
the optimal solution, x*, assuming the starting point, is already close enough to x*. 
Assuming third order differentiability, we can write 

= fix*) = fix'') + nx'){x* - x') + {l/2)f"'{y'){x* - x^f , or 
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Subtracting the equation that defines Newton's method, we have 

(^^+1 _ ^ {i/2)r{x'r'r{y') {x^ - x*f 

As we shall see in the following, this result implies that Newton's method converges 
very fast (quadratically), if we are already close enough to the optimal solution. However, 
Newton's method needs a lot of differential information about the function, something 
that may be hard to obtain. Moreover, far from the optimum, one can not be sure about 
the method's convergence. The following methods overcome these difficulties. 

Let us now examine the Golden Ratio search method, for minimizing a unidimensional 
and unimodal function, f{x) , in the interval, [x^ , x^] . Assume we know the function's value 
at four points, the extremes of the interval and two interior points, x^ < x"^ < x^ < x'^ . 
From the unimodality hypothesis we can know that the point of minimum, x* , is in one 
of the sub-itervals, that is 

f{x^)<f{x')^x*e[x\x'\ , f{x')>f{x')^x*e[x\x'\ . 

without loss of generality, let us consider the way to divide the interval [0, 1]. A ratio 
r defines a symmetric division in the form 0<1 — r<r<l. Dividing the subinterval 
[0, r] by the same ratio r, we obtain the points < r(l — r) < < r. We want the points 
and 1 — r to coincide, so that it will only be necessary to evaluate the function at one 
additional point, taht is, we want r^ + r — 1 = 0. Hence, r = (-\/5 — l)/2, this is the golden 
ratio r ^ 0.6180340. 

The golden ratio search method is robust, working for any unimodal function, and 
using only the function's value at the search points. However, the extremes of the size of 
the search interval decreases only linearly with the number of iterations. 

Polynomial methods, studied next, try to concihate the best characteristics of the 
methods already presented. Polynomial methods for minimizing an unidimensional func- 
tion, min/(x + Tj), on 1] > 0, rely on a polynomial, p{x), that locally approximates f{x), 
and the subsequent minimization of the adjusted polynomial. The simplest of these meth- 
ods is quadratic adjustment. Assume we know at three points, rii,ri2,f]3, the respective 
function values, fi = f{x+r)i). Considering the equations for the interpolating polynomial 

q{ri) ^arf + hr] + c , q{r]i) = fi 

we obtain the polynomial 

^ ^ h{r]2 - Vs) + f2(v3 - Vi) + fsivi - V2) 
-(^72 -r/i)(r/3 -^2)(r/3 -^1) 

^ ^ Mv^vs - viv2) + f2{vivi - vhs) + Mvlv2 - V2V1) 
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Equating the first derivative of the interpolating polynomial to zero, q'{r]4} = 2ari + b, 
we obtain its point of minimum, 774 = a /2b or, directly from the function's values, 

^ 1 Mvl - rj) + f2{vl - vD + fM - vD 

2 /i(773 -772) +/2(?7l -773) +/3(?72 -?7l) 

We should try to use the initial points in the "interpolating pattern" rji < r]2 < rjs e 
/i > /2 < /a, that is, three points where the intermediary point has the smallest function's 
value. So doing, we know that the minimum of the interpolating polynomial is inside of 
the initial search interval, that is, 774 e [771,773]. In this situation we are interpolating and 
not extrapolating the function, favoring the numerical stability of the procedure. 

Choosing 774 and two more points from the initial three, we have a new set of three 
points in the desired interpolating pattern, and are ready to proceed for the next iteration. 
Note that, in general, we can not guaranty that 774 is the best point in the new set of 
three. However, 774 will always replace the worst point in the old set. Hence, the sum 
z = /i + /2 + /s is monotonically decreasing. In section D.3.4 we shall see that these 
properties assure the global convergence of the quadratic adjustment line search algorithm. 

Let us now consider the errors relative to the minimum argument, ei = x* —Xi. We can 
write 64 = g{ei, 62, 63), where the function g is a second order polynomial. This is because 
774 is obtained by a quadratic adjustment, that is also symmetric in its arguments, since 
the order of the first three points is irrelevant. Moreover, it is nor hard to check that £4 is 
zero if two of the three initial errors are zero. Hence, close to the minimum, x*, we have 
the following approximation for the forth error: 

64 = (£162 + 6163 + £263) 

Assuming that the process is converging, the k-th error is approximately 6^+4 — 
Cek+itk+2- Taking Ik = log(C^/^efc), we have lk+3 — h+i + 4, with characteristic equa- 
tion — A — 1 = 0. The largest root of this equation is A 1.3. This is the order of 
convergence of this method, as defined next. 

We say that a sequence of real numbers r*^ — > r* converges at least in order p > if 

< hm ^—r = /3 < 00 

k^oo ir" — r* \P 

The sequence order of convergence is the supremum of constants p > in such conditions. 
If p = 1 and /3 < 1, we say that the sequence has linear convergence with rate /3. U (3 — 0, 
we say that the sequence has super linear convergence. 

For example, for c > 1, c is the order of convergence of the sequence a^'^''\ We 
can also see that 1/k converges in order 1, although it is not linearly convergent, because 
^k+i i^k _^ Y_ Finally, {\/kY converges in order 1, because for anyp > 1, r^+^ /[r^y — ). cxd. 
However, this convergence is super-linear, because r^+'^ /r^ — > 0. 
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D.3.3 The Gradient ParTan Algorithm 

In this section we present the method of Parallel Tangents, ParTan, developed by Shah, 
Buehler and Kempthorne (1964) for solving the problem of minimizing an unconstrained 
convex function. We present a particular case of the General ParTan algorithm, the 
Gradient ParTan, following the presentation in Luenberger (1983). 

The ParTan algorith was developed to solve exactly, after n steps, a general quadratic 
function f{x) — x'Ax + h'x + c. li A is real, symmetric and full rank matrix, it is possible 
to find the eigenvalue decomposition V'AV — D — diag(d), see section F.2. If we had 
the eigen-vector matrix, V , we could consider the coordinate transformation y — V'x, 
X — Vy, f{y) — y'V'AVy + b'Vy — y'Dy + e'y + c. The coordinate transformation given 
by (the orthogonal) matrix V can be interpreted as a decoupling operator, see Chap. 3, for 
it transforms an n-vector optimization problem into n independent scalar optimization 
problems, yi G argmin(ij(i/fc)^ + Ciyi + c. However, finding the eigenvalue decomposition 
of A is even harder than solving the original optimization problem. A set of vectors (or 
directions), is A-conjugate iff, for k ^ j, {w^yAw^ = 0. A (non-orthogonal) matrix 
of n A-conjugate vectors, W — [w^...w^] provides an alternative, and much cheaper 
decoupling operator for the quadratic optimization problem. The Partan algorithm finds, 
on the fiy, a set of n A-conjugate vectors w'^. 

To simplify the notation we assume, without loss of generality, a quadratic function 
that is centered at the origin, f{x) — x'Ax. Therefore, grad(x) = Ay, so that y'Ax — 
y'grad(x), and vectors x and y are 74-conjugate iff y is ortogonal to grad(,7;). The Partan 
algorithm is defined as follows, progressing through points x'^,x^,y^,x^, . . . x^^^, y''^^,x'^, 
see Figure D.2 (left). The algorithm is initialized by choosing an arbitrary starting point, 
x^, by an initial Cauchy step to find y^, and by taking x^ = y^. 

A^-Dimensional (Gradient) ParTan Algorithm: 

- Cauchy step: For A; = 0, 1, ... n, find y'^ — x'' + akg^ in an exact fine search along 
the k-t\i steepest descent direction, — —gxaAf{x^). 

- Acceleration step: For k — 1, . . .n — 1, find x^'^^ — y'^ + Pk{y^ — x^~^) in an exact 
fine search along the A;-th acceleration direction, [y^ — x^~^). 

In order to prove the correctness of the ParTan algorithm, we will prove, by induction, 
two statements: 

(1) The directions = {x^^^ — x^) are A-conjugate. 

(2) Although the ParTan never performs the conjugate direction line search, x^^^ = 
x^ + 7few'^, this is what implicitly happens, that is, the point x'^'^^, actually found at the 
acceleration step, would also solve the (hypothetical) conjugate direction hne search. 

The basis for the induction, /c = 1, is trivially true. Let us assume the statements are 
true up to A; — 1, and prove the induction step for the index /c, see Figure D.2 (right). 
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Figure D.2 



The Gradient Par Tan Algorithm. 



By the induction hypothesis, is the minimum of f{x) on the A;- dimensional hy- 
perplane through spanned by all previous conjugate directions, w\ j < k. Hence, 
g'^ = — grad/(a;'^) is orthogonal to all uP , j < k. All previous search directions lie in the 
same /c-hyperplane, hence, g'' is also orthogonal to them. In particular, g'^ is orthogonal 
to g''"^ — — grad/(a;'^^^). Also, from the exact Cauchy step from to y'', we know 
that g'^ must be orthogonal to grad/(y'^). Since grad/(a:) is a hnear function, it must 
be orthogonal to g'' at any point in the hne search x*^"*"^ = y'^ + Pk{y^ — x^~^). Since 
this line search is exact, grad/(a;^"'"^) is orthogonal to {y^ — x^~^). Hence grad/(a;*^"'"^) is 
orthogonal to any linear combination of g'^ and {y^ — x^~^), including w'^. For all other 
products {w^y Aw^ , , j < k — 1, we only have to write w'^ as a linear combination of 
g^ and w*^"^ to see that they vanish. This is enough to conclude the induction step of 
statements (1) and (2). QED. 

Since a full rank matrix A can have at most n simultaneous A-conjugate directions, 
the Gradient Par Tan must find the optimal solution of a quadratic function in at most 
n steps. This fact can be used to show that, if the quadratic model of the objective 
function is good, the ParTan algorithm converges quadratically. Nevertheless, even if the 
quadratic model for the objective function is poor, the Cauchy (steepest descent) steps can 
make good progress. This explains the Gradient ParTan robustness as an optimization 
algorithm, even if it starts far away from the optimal solution. 

The ParTan needs two hne searches in order to obtain each conjugate direction. Far 
away from the optimal solution a Cauchy method would use only one line search. Close 
to the optimal solution alternative versions of the ParTan algorithm, known as Conjugate 
Gradient algorithms, achieve quadratic convergence using only one line search per dimen- 
sion. Nevertheless, in order to use these algorithms one has to devise a monitoring system 
that keeps track of how well the quadratic model is doing, and use it to decide when to 
make the transition from the Cauchy to the Conjugate Gradient algorithm. Hence, the 
Partanization of search directions provides a simple mechanism to upgrade an algorithm 
based on Cauchy (steepest descent) line search steps, accelerating it to achieve quadratic 
convergence, while keeping the robustness that is so characteristic of Cauchy methods. 
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D.3.4 Global Convergence 

In this section we give some conditions that assures global convergence for a NLP algo- 
rithm. We follow the ideas of Zangwill (1964), similar analyses are presented in Luenberger 
(1984) and Minoux and Vajda (1986). 

We define an Algorithm as an iterative process generating a sequence of points, 
x^,x^,x^ . . ., that oby a recursion equation of the form x*^'*'^ e Ak{x''), where the point- 
to-set map Aklx'^) defines the possible successors of x'^ in the sequence. 

The idea of using an point-to-set map, instead of a ordinary function or point-to-point 
map, allows us to study in a unified way a hole class of algorithms, including alternative 
implementations of several details, approximate or inexact computations, randomized 
steps, etc. The basic property we look for on the maps defining an algorithm is closure, 
defined as follows. 

A point-to-set map from space X to space Y, is closed at x if the following condition 
holds: If a sequence x'' converges to x E X, and the sequence converges to y & Y, 
where y'' e A{x), then the also the limit y is in the image A{x), that is, 

x'' ^ X , y'' ^ y , y^ e A{x'^) =^ y e A{x) . 

The map is closed in C C X if it is closed at any point of C. Note that if we replace, 
in the definition of closed map, the inclusion relation by the equality relation, we get 
the definition of continuity for point-to-point functions. Therefore, the closure property 
is a generalization of continuity. Indeed, a continuous function is closed, although the 
contrary is not necessarily true. 

The basic idea of Zangwill's global convergence theorem is to find some characteristic 
that is continuously "improved" at each iteration of the algorithm. This characteristic is 
represented by the concept of descendence function. 

Let A be an algorithm in X for solving the problem P, and let C X be the solution 
set for P. A function Z{x) e is a descendence function for [X, A, S) if the composition of 
Z and A is always decreasing outside the solution set, and does not increase inside the 
solution set, that is, 

X ^ S Ay e A{x) Z{y) < Z{x) and x e S Ay e A{x) =^ Z{y) < Z{x) . 

In optimization problems, some times the very objective function is a good descendence 
function. Other times, more complex descendence functions have to be used, for example, 
the objective function with auxiliary terms, like penalties for constraint violations. 

Before we state Zangwill's theorem, let us review two basic concepts of set topology: 
An accumulation point os a sequence is a limit point for one of its sub-sequences. A set 
is compact iff any (infinite) sequence has an accumulation point inside the set. In i?", a 
set is compact if it is closed and bounded. 
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Zangwill's Global Convergence Theorem: 

Let Z be a dcscendence function for the algorithm A defined in X with solution set 
S, and let , . . . be a sequence generated by this algorithm such that: 

A) The map A is closed in any point outside S, 

B) All points in the sequence remain inside a compact set C Q X, and 

C) Z is continuous. 

Then, any accumulation point of the sequence is in the solution set. 

Proof: From C compacity, a sequence generated by the algorithm has a limit point, 
X e C C X, for a subsequence, x'^^''\ From the continuity of Z in X, the limit value of Z in 
the subsequence coincides withg the value of Z at the limit point, that is, Z{x^''''^) — )■ Z{x). 
But the complete sequence, Z{x'') is monotonically decreasing, hence, if s{k) < j < 
s{k + 1) then Z{x'^''^) > Z{x^) > Z{x'^''+^^), and the value of Z in the complete sequence 
also converges to the value of Z at the accumulation point, that is Z{x'') — > Z{x). 

Let us now imagine, for a proof by contradiction, that Z{A{x)) < Z{x). Let us 
consider the sub-sequence of the successors of the points in the first sub-sequence, x^^'^^'^^. 
This second sub-sequence, again by compacity, also has an accumulation point, x'. But 
from the result in the last paragraph, the value of the descendence function in both 
sub-sequences converge to the limit value of the hole sequence, that is, lim = 
limZ(a;*^) = limZ{x^^''^). So we have prooved the impossibility of x not being a solution. 

Several algorithms are formulated as a composition of several steps. Hence, the map 
describing the hole algorithm is the composition of several maps, one for each step. A 
typical example would be a step for choosing a search direction, followed by a step for a 
line search. The following lemmas are useful in the construction of such composite maps. 

First Composition Lemma: Let A from X to Y, and B from Y to Z, be point-to-set 
maps, A closed in x & X, B closed in A{x). If any sequence x'^ converging to x, G /l(a;^) 
has an accumulation point y, then the composed map B o A is closed in x. 

Second Composition Lemma: Let A from X to Y, and B from Y to Z, be point-to 
set maps, A closed in x E X, B closed in A{x). If Y is compact, then the composed map, 
B o A is closed in x. 

Third Composition Lemma: Let A be a point-to point map from X in Y, and B a 
point-to-set map from Y to Z. If A is continuous in x, and B is closed in A[x). then the 
composed map B o A is closed in x. 



D.4 Variational Principles 

The variational problem asks for the function q{t) that minimizes a global functional 
(function of a function), J{q), with fixed boundary conditions, q{a) and q{b), as shown in 
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Figure D.3. Its general form is given by a local functional, F{t, q, q'), and an integral or 
global functional, 

J(g)= f F{t,q,q')dt, 

J a 

where the prime indicates, as usual, the simple derivative with respect to that is, 
— dq/dt. 





Figure D.3: Variational problem, q{x), r]{x), q{x) + r]{x) 



Euler-Lagrange Equation 

Consider a 'variation' of q{t) given by another curve, r]{t), satisfying the fixed boundary 
conditions, 77(a) = 77(6) = 0, 



q = q{e, t) = q{t) + triit) and 
J(e)= I F{t,q{e,t),q'{e,t))dt . 

J a 

A minimizing q{t) must be stationary, that is, 

I F{t,q{e,t),q'{e,t))dt^O . 

J a 



dJ__d_ 
de de 



Since the boundary conditions are fixed, the differential operator affects only the inte- 
grand, hence 

dJ fdFdq dFdq'\ , 



de J a \dq de dq' de 
From the definition of q{e, t) we have 

d q d q' 

We ^ ' ~de ^ ^'^^^ ' ^^"^^ ' 
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dJ__ f fdF ... OF 
Integrating the second term by parts, we get 



. 



rdF dF ^ ' r d fdF\ , , , 

where the first term vanishes, since the extreme points, 77(a) — rj{b) — 0, are fixed. Hence 

dJ fdF ddF\ 

Since ri{t) is arbitrary and the integral must be zero, the parenthesis in the integrand 
must be zero. This is the Euler-Lagrange equation: 

dF d dF 

dq dtdq' 

Noether Theorems 

Nother theorems estabhshes very general conditions under which the existence of a sym- 
metry in the system, described by the invariance under the action of a continuous group, 
implies the existence of a quantity that remains constant in the system's evolution, that 
is, a conservation law, see for example Byron and Fuller (1969, V-I, Sec. 2.7). 

For example, consider a functional F{t, q, q') that does not depends explicitly of q. This 
situation reveals a symmetry: The system is invariant by a translation on the coordinate 
q. From Euler-Lagrange equatuion, it follows that the quantity p ~ d F/dq' is conserved. 
In the language of classical mechanics, q would be called a "cyclic coordinate" , while p 
would be called a "generalized moment" . 

Let us consider the lifeguard's problem from section 5.5. Using the variable t instead 
of X, and q instead of y, the length of an infinitesimal arch is ds^ = dt^ + dq^ and we can 
build the total travel time using the functional 

F{t,q,q')^u{t)^/lVq' 

Since the local functional is not a function of q, the Euler-Lagrange equation reduces to 
dF/dq' — K, where X is a constant. Hence, the hfeguard's problem solution is 

If the resistance index v{t) is also independent of q' must be a constant, so that g is a 
straight line, as we have guessed in our very informal solution. In general, the solution to 
the lifeguard's problem is given by 

, , tan(6') / N . 

+ tan(6') 



Appendix E 

Entropy and Asymptotics 



"...we can identify that quantity which we commonly designate as 
(thermodynamic) entropy with the probability of the actual state." 

Ludwig Bolttzmann (1844 - 1906). 
Warmetheorie und der Wahrscheinlischkeitrechnung, 1877. 

The origins of the entropy concept lay in the fields of Thermodynamics and Statistical 
Physics, but its applications have extended far and wide to many other phenomena, 
physical or not. The entropy of a probability distribution, H{p{x)), is a measure of 
uncertainty (or impurity, confusion) in a system whose states, x & X, have p{x) as 
probability distribution. We follow closely the presentation in the following references. 
For the basic concepts: Csiszar (1974), Dugdale (1996), Kinchine (1957) and Renyi (1961, 
1970). For MaxEnt characterizations: Gokhale (1975) abd Kapur (1989). For MaxEnt 
optimization: Censor and Zenios (1994, 1997), Elfving (1980), Fang ct al. (1997) and 
lusem and Pierro (1987). For posterior asymptotic convergence: Gelman (1995). 

For a detailed analysis of the connection between MaxEnt optimization and Bayesian 
statistics' formalisms, that is, for a deeper view of the relation between MaxEnt and 
Bayes' rule updates, see Caticha and Giffin (2007) and Caticha (2007). 

E.l Convexity 

We first introduce the concept of convexity, that is going to be important throughout this 
chapter. 

Definition: A region S G i?" is Convex iff, for any two points, x^,x^ G S, and weights 
< /i, ^2 < 1 Ui + ^2 = 1, the convex combination of these two points remains in S, i.e. 
hx^ + hx"^ G S. 
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Theorem Finite Convex Combination: A region S G i?" is Convex iff any (finite) 
convex combination of its points remains in the region, i.e., V < Z < 1 1 I'Z = 1, 
X = [x\x^,...x'^], x^ e S, 



XI 



1 2 

1 1 

^2 ^2 



™1 ™2 



XI 









h 







e S 



Proof: By induction in the number of points, m. 

Definition: The Epigraph of the function ip : R"^ R is the region of X "above the 
graph" of ifi, i.e. 

Epi ((/?) = {x e I Xn+l > if {[Xi, X2,..., Xn]') } 

Definition: A function (f is convex iff its epigraph is convex. A function (f is concave 
iff — (/? is convex. 

Theorem: A differentiable function, (p : R ^ R, with non negative second derivative 
is convex. 

Proof: Consider x° = hx^ + hx"^, and the Taylor expansion around 

^{x) = + - x°) + {l/2)^"{x*){x - 

where x* is an appropriate intermediate point. If (fi"{x*) > the last term is positive. Now, 
making x — x^ and a; = we have, respectively, that (f{x^) > 'fi{x'^) + (fi' {x'^)li{x^ — x'^) 
and (fi^x"^) > V9(a;°) + (fi'{x^)l2{x'^ — x^) multipying the first inequality by li, the second by 
I2, and adding them, we obtain the desired result. 

Theorem Jensen Inequality: If </? is a convex function, 

E{^{x))>^{E{X)) 



For discrete distributions the Jensen inequality is a special case of the finite convex 
combination theorem. Arguments of Analysis allow us to extend the result to continuous 
distributions. 



E.2 Boltzmann-Gibbs-Shannon Entropy 

If H{p{x)) is to be a measure of uncertainty, it is reasonable that it should satisfy the 
following list of requirements. For the sake of simplicity, we present the theory for finite 
spaces. 
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1) If the system has n possible states, Xi, . . . the entropy of the system with a given 
distribution, pi = p{xi), is a function 

2) H is a continuous function. 

3) if is a function symmetric in its arguments. 

4) The entropy is unchanged if an impossible state is added to the system, i.e., 

Hn{Pl: ■ ■ -Pn) = -f^n+l(Pl, • • •Pn,0) 

5) The system's entropy in minimal and null when the system is fully determined, i.e., 

i/„(0,...,0,l,0,...0) = 

6) The system's entropy is maximal when all states are equally probable, i.e., 

— 1 = arg max 

n 

7) A system maximal entropy increases with the number of states, i.e. 

8) Entropy is an extensive quantity, i.e., given two independent systems, with distri- 
butions peg, the entropy of the composite system is additive, i.e., 

H„m{r) = Hn{p) + Hm{q) , = Pi Qj 

The Boltzmann-Gibbs-Shannon measure of entropy, 

Hn{p) = -In{p) --y\. ,Pi log(Pi) = - l0g(p,) , log(O) = 

■^—'1=1 

satisfies requirements (1) to (8), and is the most usual measure of entropy. In Physics it 
is usual to take the logarithm in Nepper base, while in Computer Science it is usual to 
take base 2 and in Engineering it is usual to take base 10. The opposite of the entropy, 
I{p) — ~H(p), the Neguentropy, is a measure of Information available about the system. 

For the Boltzmann-Gibbs-Shannon entropy we can extend requirement 8, and compute 
the composite Neguentopy even without independence: 

En,m ^ — yn,m 
i=ij=i '«=i,j=i 
. Pi\og{pi)y_^, Pr(j|i) + > . Pi> . Pr(j I i) log (Pr(j I i)) 

1=1 ^—^3=i ■'—'1=1 'j=l 

En 
. , Pi Ini{q') where q) = Pr(j | i) 
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If we add this last identity as item number 9 in the hst of requirements, we have 
a characterization of Boltzmann-Gibbs-Shannon entropy, see Kinchine (1957) and Renyi 
(1961, 1970). 

Like many important concepts, this measure of entropy was discovered and re-discovered 
several times in different contexts, and sometimes the uniqueness and identity of the con- 
cept was not immediately recognized. A well known anecdote refers the answer given 
by von Neumann, after Shannon asked him how to call a "newly" discovered concept in 
Information Theory. As reported by Shannon in Tribus and Mclrvine (1971, p. 180): 

"My greatest concern was what to call it. I thought of calling it information, but the 
word was overly used, so I decided to call it uncertainty. When I discussed it with John 
von Neumann, he had a better idea. Von Neumann told me. You should call it entropy, 
for two reasons. In the first place your uncertainty function has been used in statistical 
mechanics under that name, so it already has a name. In the second place, and more 
important, nobody knows what entropy really is, so in a debate you will always have the 
advantage. " 

E.3 Csiszar's (/9-divergence 

In order to check that requirement (6) is satisfied, we can use (with g oc 1) the following 
lemma: 

Lemma: Shannon Inequality 

If p and q are two distributions over a system with n possible states, and ^ 0, then the 
Information Divergence of p relative to q, Inip, q), is positive, except ii p — q, when it is 



null. 




Proof: By Jensen inequality, if </? is a convex function. 



E{^{x))>^{E{X)) 



Taking 



tln{t) and ti 




Shannon's inequality motivates the use of the Information Divergence as a measure 
of (non symmetric) "distance" between distributions. In Statistics this measure is known 
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as the Kullback-Leibler distance. The denominations Directed Divergence or Cross Infor- 
mation are used in Engineering. The proof of Shannon inequahty motivates the following 
generalization of divergence: 

Definition: Csiszar's (/9-divergence. 
Given a convex function ip, 

En 
i=l ^ 

For example, we can define the quadratic and the absolute divergence as 

ioY Lp(t) = {t-lf 

for ip{t) = \t-l\ 




AhM = 



E.4 Maximum Entropy under Constraints 

Given a prior distribution, g, we would like to find a vector p that minimizes the Infor- 
mation Divergence In{p,q), where p is under the constraint of being a probability dis- 
tribution, and maybe also under additional constraints over the expectation of functions 
taking values on the system's states, that is, we want 

p* e argmin/„(p, q) , p > | I'p = 1 and Ap = 6 , A {m — 1) x n 

p* is the Minimum Information Divergence distribution, relative to q, given the con- 
straints {A, b}. We can write the probability normalization constraint as a generic linear 
constraint, including 1 and 1 as the m-th (or 0-th) rows of matrix A and vector b. So 
doing, we do not need to keep any distinction between the normalization and the other 
constraints. In this chapter, the operators e indicate the point (element) wise product 
and division between matrices of same dimension. 

The Lagrangean function of this optimization problem, and its derivatives are: 

L(p, w) = p' log(p (Z)q)+w'{b- Ap) , 

d L d L 

— = \og{pi/qi) + 1 - w'A' , - — = 6fc - AkP . 
opi dwk 

Equating the n + m derivatives to zero, we have a system with n + m unknowns and 
equations, giving viability and optimality conditions (VOCs) for the problem: 

Pi = qi exp (w'A^ — l) ou p = q(D exp {{w' A)' — 1) 
AkP = h , P > 
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We can further replace the unknown probabilities, Pi, writing the VOCs only on w, 
the dual variables (Lagrange multipliers), 

hk{w) = Ak{qQ exp {{w' A)' - 1)) - 6^ = 

The last form of the VOCs motivates the use of iterative algorithms of Gauss-Seidel 
type, solving the problem by cyclic iteration. In this type of algorithm, one cyclically 
"fits" one equation of the system, for the current value of the other variables. For a 
detailed analysis of this type of algorithm, sec Censor and Zenios (1994, 1997), Elfving 
(1980), Garcia et al. (2002) and lusem and Pierro (1987). 

Bregmann Algorithm: 

Initiahzation: Take t — 0, e R^, and 

pI = Qi exp (w^'A' - l) 

Iteration step: for t = 1, 2, . . ., Take 

k — {t mod m) and u \ (pii') = , where 

= [w{,...wi_^,wi + u,wi+^,...wi]' 

Pt^^ = gjexp(w*+^' A* - 1) = plexp{i'Al) 
ip{u) = AkP*+^-bk 

From our discussion of Entropy optimization under linear constraints, it should be clear 
that the minimum information divergence distribution for a system under constraints on 
the expectation of functions taking values on the system's states, 

Ep[x)0'k{,x) = J ak{x)p{x)dx = bk, (including the normalization constraint, ao = 1, 6o = 
1) has the form 

p{x) = q{x) exp (-^0 - 6*1 «i(a^) - ^2 ^2(3^) • • •) 

Note that we took 9q = —{wq — 1), 0^ = —Wk, and we have also indexed the state i 
by variable x, so to write the last equation in the standard form used in the statistical 
literature. 

Several distributions commonly used in Statistics can be interpreted as minimum 
information (or MaxEnt) densities (relative to the uniform distribution, if not otherwise 
stated) given some constraints over the expected value of state functions. For example: 

The Normal distribution is characterized as the distribution of maximum entropy on 
R"', given the expected values of its first and second moments, i.e., mean vector and 
covariance matrix. 
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The Wishart distribution: 

f{S I ly, V) = ciu, V) exp (^ '"^"^ log(dct(S)) - J]. . V,,S,,i^ 

is characterized as the distribution of maximum entropy in the support 5* > 0, given the 
expected value of the elements and log-determinant of matrix 5". That is, writing F' for 
the digamma function, 

The Dirichlet distribution 

fix I 9) = c{9) exp {J2Z^^>' ~ ^"S^^'^O 

is characterized as the distribution of maximum entropy in the support x > \ I'x — 1, 
given the expected values of the log-coordinates, E(log(xjt)). 

Jeffrey's Rule: 

Richard Jeffrey considered the problem of updating an old probability distribution, q, 
to a new distribution, p, given new constraints on the probabilities of a partition, that is, 

y2.^Pi = '^k: «fc = 1 ' 5'i U...U 5^ = {!,... n} , Sir\Sk = ^, l^k . 

His solution to this problem, known as the Jeffrey's rule, coincides with the minimum 
information divergence distribution, relative to q, given the new constraints. This solution 
can be expressed analytically as 

Pi = akQi/ . ^ 9j , k\ie Sk . 



E.5 Fisher's Metric and Jeffreys' Prior 

The Fisher Information Matrix, J [6), is defined as minus the expected Hessian of the log- 
likelihood. Under appropriate regularity conditions, the information geometry is defined 
by the metric in the parameter space given by the Fisher information matrix, that is, the 
geometric lenght of a curve is computed integrating the form dl'^ = d9' J{6)d9. 

Lemma: The Fisher information matrix can also be written as the covariance matrix 
of for the gradient of the same likelihood, i.e., 

. , _ d'^\ogp{x\9) fd\ogp{x\9) d\ogp{x\9) 

J[9) = - Ex ^ = Ex 



d9^ \ 89 09 

Proof; 



[p{x\9)dx^l^ [ ^-^dx^O 
Jx Jx 99 
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dp(x I 6) p(x \9) , d logp(x I 9) 
ax — - 



Jx de p{x\e) " de 

differentiating again relative to tlie parameter, 



-p{x I e)dx = 



X 



d'^\ogp{x I 9) 



p{x I 9) + 



d logp(a; I 9) dp{x 



89 



09 



dx — 



observing tliat tlie second term can be written as 



d logp(a; I 9) dp{x \ 9) p{x 



09 

we obtain tlie lemma. 



89 p{x\9) 



dx 



d logp(a; I 9) d logp(a; | 9) 



X 



d9 



89 



p{x I 9)dx 



Harold Jeffreys used the Fisher metric to define a class of prior distributions, propor- 
tional to the determinant of the information matrix, 

p{9)^\J{9)\"\ 

Lemma: Jeffreys' priors are geometric objects in the sense of being invariant by a 
continuous and differentiable change of coordinates in the parameter space, 77 = f{9)- 
The proof follows Zellner (1971, p.41-54): 



Proof: 



09 



drj 



drj 



09^ 



hence 
and 



09 

J{r))\^^^ drj . Q.E.D. 



m- 

|J(^)|'/' = 
|J(^)|'/' d9 
Example: For the multinomial distribution, 

nm / -I— rTO. V — \m—l 

Em 
Xi log 9i , 
1=1 



m—l 



Xf, 



n 



Eiii— 
i=i 



O^L 

WW 



-E 



{09if 
\J{9)\ = {9^9^...9 



O^L 
09,09, 



n ^ n 

9i 9m 



'91 



i,j = l...m- 1 , 



-E 



O^L 



n 



X 



09i09j 9r, 



, p{9)^{9,92...9j-'/' 

Xl- 1/2^x2 -1/2 ^j;^-l/2 



p{9\x) oc9t''"''9'^'-''\..9-,^ 

Hence, in the multinomial exemple, Jeffreys' prior "discounts" half an observation 
of each kind, while the maxent prior discounts one full observation, and the flat prior 
discounts none. Similarly, slightly different versions of uninformative priors for the multi- 
variate normal distribution are shown in section C.3. This situation leads to the possible 
criticism stated in Berger (1993, p. 89): 
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"Perhaps the most embarassing feature of noninformative priors, however, is 
simply that there are often so many of them. " 

One response to this this criticism, to which Berger (1993, p. 90) exphcitly subscribes, is 
that 

"it is rare for the choice of a noninformative prior to makedly affect the an- 
swer... so that any reasonable noninformative prior can be used. Indeed, if 
the choice of noninformative prior does have a pronouced effect on the answer, 
then one is probably in a situation where it is crucial to involve subjective prior 
information. " 

The robustness of the inference procedures to variations on the form of the uninforma- 
tive prior can tested using sensitivity analysis, as discussed in section A. 6. For alternative 
approaches, on robustness and sensitivity analysis, see Berger (1993, sec.4.7). 

In general Jeffrey's priors are not minimally informative in any sense. However, Zell- 
ner (1971, p.41-54. Appendix do chapter 2: Prior Distributions Representing "Knowing 
Little") gives the following argument (attributed to Lindley) to present Jeffreys' priors 
as asymptotically minimally informative. The information measure of p{x \ 9), I {9); The 
prior average information, A; The information gain, G, that is, the prior average infor- 
mation associated with an observation, A, minus the prior information measure; and The 
asymptotic information gain, Ga, are defined as follows: 



Although Jeffreys' priors does not in general maximize the information gain, G, the asymp- 
totic convergence results presented in the next section imply that Jeffrey's priors maximize 
the asymptotic information gain, Ga- For further details and generalizations, see Amari 



(2007), Amari et al. (1987), Berger and Bernardo (1992), Berger (1993), Bernardo and 
Smith (2000), Hartigan (1983), Jeffreys (1961), SchoU (1998), and Zhu (1998). 

E.6 Posterior Asymptotic Convergence 



m 





A^ J I{9)p{9)d9 ; 
G = A-[ p{9)\ogp{9)d9 ■ 



G. 



a 



The Information Divergence, I{p, q), can be used to proof several asymptotic results that 
are fundamental to Bayesian Statistics. We present in this section two of these basic 
results, following Gelman (1995, Ap.B). 



352 



APPENDIX E. ENTROPY AND ASYMPTOTICS 



Theorem Posterior Consistency for Discrete Parameters: 

Consider a model where f{9) is the prior in a discrete parameter space, © = {9^, 9^, . . .}, 
X — [x^, . . . x"'] is a series of observations, and the posterior is given by 

f{9' I X) « f{9') p{X I 9') = f{9') JX^^Pi^' I 

Further, assume that this model there is a single value for the vector parameter, 9^, 
that gives the best approximation for the "true" predictive distribution g{x), in the sense 
that it minimizes the information divergence 

{9^} = axgmml (g{x),p{x\9'')) 

k 

Then, 

lim f{9''\X) = S{9'',9^) 

n-^oo 

Heuristic Argument: Consider the logarithmic coefficient 

The first term is a constant, and the second term is a sum which terms have all negative 
expected (relative to x, for A; 7^ 0) value since, by our hypotheses, 9^^ is the unique 
argument that minimizes I{g{x),p{x \9^)). Hence, (for A; ^ 0), the right hand side goes 
to minus infinite as n increases. Therefore, at the left hand side, f{9^ \ X) must go to 
zero. Since the total probability adds to one, f{9^ \ X) must go to one, QED. 

We can extend this result to continuous parameter spaces, assuming several regular- 
ity conditions, like continuity, differentiability, and having the argument ^° as a interior 
point of with the appropriate topology. In such a context, we can state that, given a 
pre-established small neighborhood around 9^^ like C{9^,e) the cube of side size e cen- 
tered at 9^ ^ this neighborhood concentrates almost all mass of f{9 \ X), as the number of 
observations grows to infinite. Under the same regularity conditions, we also have that 
Maximum a Posteriori (MAP) estimator is a consistent estimator, i.e., 9 9^ . 

The next results show the convergence in distribution of the posterior to a Normal 
distribution. For that, we need the Fisher information matrix identity from the last 
section. 

Theorem Posterior Normal Approximation: 

The posterior distribution converges to a Normal distribution with mean 9^ and precision 
nJ{9^). 
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Proof (heuristic): We only have to write the second order log-posterior Taylor expansion 
centered at 9, 

l„g/(9|A-) = log/(9|A-) ((,_«) 

The term of order zero is a constant. The linear term is null, for 9 is the MAP 
estimator at an interior point of ©. The Hessian in the quadratic term is 



H{9) 



dHogf{9\X) 



d'^\ogf{9) sr^n d'^\ogp{x' 



09^ 89^ ^i=i 89^ 

The Hessian is negative definite, by the regularity conditions, and because 9 is the MAP 
estimator. The first term is constant, and the second is the sum of n i.i.d. random 
variables. At the other hand we have already shown that the MAP estimator, and also 
that all the posterior mass concentrates around 9^. We also see that the Hessian grows 
(in average) linearly with n, and that the higher order terms can not grow super-linearly. 
Also for a given n and 9^9, the quadratic term dominates all higher order terms. Hence, 
the quadratic approximation of the log-posterior in increasingly more precise, Q.E.D. 

Given the importance of this result, we present an alternative proof, also giving the 
reader an alternative way to visualize the convergence process, see Figure 1. 

Theorem MLE Normal Approximation: 

The Maximum Likelihood Estimator (MLE) is asymptotically Normal, with mean 9^ and 
precision nJ{9^). 

Proof (schematic): Assuming all needed regularity conditions, from the first order opti- 
mality conditions, 

1 8 \ogp{x^ I 9) 

n ^1=1 d~9 







hence, by the mean value theorem, there is an intermediate point 9 such that 

1 9 \ogp{x^ I 9^) 1 8'^\ogp{x^ I 9) q 

n^i=i 89 ^n^i=i d9~^ ^ ~ 

or, equivalently, 

8 logp(a;* I 9^) 



^{9 - ^°) = - 



1 9^1ogp(x* 



n 



89^ 



En 
i=l 



89 



We assume the regularity conditions are enough to assure that 



1 



n 8'^\ogp{x^ \9) 



89-" 



J{9 



0\-l 
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for the MLE is consistent, 9^9^, and hence so is the mean value point, 9^9^; and 



because we have the sum of n i.i.d. vectors with mean and, by the Information Matrix 
Identity lemma covariance J (9^). 

Hence, we finally have 



Exercises: 

1) Implement Bregmann's algorithm. It may be more convenient to number the rows 
of A from 1 to m, and take k — {t mod m) + 1. 

2) I was given a dice, that I assumed to be honest. A friend of mine lent the dice and 
reported playing it 60 times, obtaining 4 i's, 8 ii's, 11 iii's, 14 iv's, 13 v's and 10 vi's. 

A) What is my Bayesian posterior? 

Bi) What was the mean face value? (3.9). 

Bii) What is the expected posterior value of this statistic? 

C) I called the dice manufacturer, and he told me that this dice is made so that the 
expected value of this statistic is exactly 4.0. Use Bregmamnn algorithm to obtain the 
"entropic posterior", that is, the distribution closest to the prior that obeys the given 
constraints. Use as prior: 1) the uniform; ii) the Bayesian posterior. 

3) Discuss the difference between the Bayesian update and the entropic update. What 
is the information given in each case? Observations or constraints? 

4) Discuss the possibility of using the FBST to make hierarchical tests for complex 
hypotheses using these ideas. 

5) Try to give MaxEnt characterizations and Jeffrey's priors for all distributions you 
know. 




V^(9- 9") ^ N (0, J {9'')-^ J {9'') J {9'')-^) = N (O, J(^°)-^) 



Q.E.D. 



Appendix F 



Matrix Factorizations 



F.l Matrix Notation 



Let us first define some matrix notation. Tlie operator /: s: t, to be read from f tot with 
step s, indicates tiie vector [/, / + s, / + 2s, . . . t] or tiie corresponding index domain. / : t 
is a short hand for / : 1 : t. The element in the i-th row and j-th column of matrix A is 
written as A{i, j) oi, with subscript row index and superscript column index, as A^. Index 
vectors can be used to build a matrix by extracting from a larger matrix a given sub-set 
of rows and columns. For example, A(l •.m/2,n/2:n) or A^!^"^ is the northeast block, 
i.e. the block with the first rows and last columns, from A. The next example shows a 
more general case of this notation, 

' 11 12 13 ' 

=[l 3], s=[3 1 2] 



A^ 



21 
31 



22 
32 



23 
33 



A{r,s) 



13 11 12 
33 31 32 

The suppression of an index vector indicates that the corresponding index spans all values 
in its current context. Hence, A{i, :) or A^ indicates the i-th row, and A{:,j) or A^ 
indicates the j-th column of matrix A. 

A single or multiple list of matrices is referenced by one or more indices in braces, like 
A{k} or A{p,q}. As for element indices, for double lists we may also use the subscript 
- superscript alternative notation for A{p,q}, namely, This compact notation is 

specially usefuU for building block matrices, like in the following example, 

A{i} 

Ml} Ml} 



Ml} Mf} 



A{l} 
M^ 



Mr}\ 
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Hence, A{p, q}{i,j) or A{p}i indicates the element in the i-th row and j-th column of the 
block situated at the p-th block of rows and g-th block of columns of matrix A, A{p, q}{'-,j) 
or A{'^}^ indicates the j-th column of the same block, and so on. 

An upper case letter usually stands for (or starts) a matrix name, while lower case 
letters are used for vectors or scalars. Whenever recommended by style or tradition, we 
may slightly abuse the notation using upper case for the name of a matrix and lower case 
for some of its parts. For example, we may write , instead of for the j-th column of 
matrix X. 

The vectors of zeros and ones, with appropriate dimension given by the context, are 
and 1. The transpose of matrix M is M', and the transpose inverse, M~*. In (M -|- v), 
where v is a column (row) vector of compatible dimension, v is added to each column 
(row) of matrix M. 

A tilde accent. A, indicates some simple transformation of matrix A. For exemple, 
it may indicate a row and / or column permutation, see next subsection. A tilde accent 
may also indicate a normalization, like x — {l/\\x\\)x. 

The p-norm of a vector x is given by \\x\\p = (^ Ixjp)"*'. Hence, for a non-negative 
vector X, we can write its 1-norm as ||x||i = I'x. y > is a positive definite matrix. 
The Hadamard or pointwise product, 0, is defined hy M ^ Aq B <^ ^ A^ Bf. The 
squared Frobenius norm of a matrix is defined by frob2(M) = 

The Diagonal operator, diag, if applied to a square matrix, extracts the main diagonal 
as a vector, and if applied to a vector, produces the corresponding diagonal matrix. 



diag(A) 



Al 



An 
L 



diag(a) 



tti 

02 










diag^(^) 



A\ 







Aj ... 







An 
J 



The Kroneker product of two matrices is a block matrix where block {i, j} is the 
second matrix multiplied by element (i, j) of the first matrix: 



A®B 



A\B A\B 



A\B 



AlB 



The following properties are easy to check: 



• {A®B){C®D) = {AC) ® {BD) 

• {A®B)' ^ Al ®B' 

• {A®By^ ^ A-^ ®B-^ 
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The Vec operator stacks the columns of a matrix into a single column vector, that is, 
if ^4 is m X n, 

" A' 

Vec{A) 



The following properties are easy to ckeck: 



Vec(^ + B)= Vec(^) + Vec(5) 
AB^ 



Yec{AB) 



AB' 



(/ ® A) Vec{B) 



Permutations and Partitions 



We now introduce some concepts and notations related to the permutation and partition 
of an m X n matrix A. A permutation matrix is a matrix obtained by permuting rows and 
columns of the identity matrix, /. To perform on / a given row (column) permutation 
yields the corresponding row (column) permutation matrix. 

Given row and column permutation matrices, P and Q, the corresponding vectors of 
permuted row and column indices are 



p=(P 



1 

2 



m 



g= [ 1 2 ... n]Q 

To perform a row (column) permutation on a matrix A, obtaining the permuted matrix 
A, is equivalent to multiply it at the left (right) by the corresponding row (column) 
permutation matrix. Moreover, if p (q) is the corresponding vector of permuted row 
(column) indices, 

Ap^PA^ IpA , A'' = AQ = P. 
Exemple: Given the martices 





' 11 


12 


13 " 




' 





1 " 




' 


1 


" 


A = 


21 


22 


23 


, P = 


1 








, Q = 








1 




31 


32 


33 







1 







1 
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p = q=[3 1 2 ] , PA 



31 32 33 
11 12 13 
21 22 23 



13 11 12 
23 21 22 
33 31 32 



A square matrix, A, is symmetric iff it is equal to its transpose, that is, iS A — A'. 
A symmetric permutation of a square matrix A is a permutation of form A — PAP' or 
A — Q'AQ, where P or Q are (row or column) permutation matrices. A square matrix, 
A, is orthogonal iff its inverse equals its transpose, that is, iff A~^ = A'. The following 
statements are easy to check: 

(a) A permutation matrix is orthogonal. 

(b) A symmetric permutation of a symmetric matrix is still symmetric. 

A permutation vector, p, and a termination vector, t, define a partition of m original 
indices in s classes: 



p(l) 




'p{t{l) + l) ' 




' p{t{s-l) + l) 


.pm). 




pm) 






where t{0) = 


< t{l) < ... 


< t{s - 


- 1) < t{s) = m 



We define the corresponding permutation and partition matrices, P and T, as 



P — Ip{i : m) — 



P{1} 
P{2} 

P{s}\ 



> P{'>^} = Ip{t{r-l)+i:t{r)) , 



Tr^l'{P{r}) and 



These matrices facilitate writing functions of a given partition, like 
• The indices in class r 

P{r}{l:m) = P{r} 



1 




' p{t{r + ' 


m 




p{t{r)) 



The number of indices in class r 



1 = t{r) - t{r - 1) ; 
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A sub-matrix with the row indices in class r 



P{r}A 



A 



■p(t(r-l)+l) 



A. 



p{t{r)) 



• The summation of the rows of a submatrix with row indices in class r 

TrA^l'{P{r}A) ; 

• The rows of a matrix, added over each class 



TA 



TsA 



Note that a matrix T represents a partition of m idices into s classes if T has dimension 
s X m, e {0,1} and T has orthogonal rows. The element indicates if the index 
J e 1 : m is in class h & l:s. 



F.2 Dense LU, QR and SVD Factorizations 

Vector Spaces and Projectors 

Given two vectors, x,y G i?", their scalar product is defined as 

n 

x'y = ^ Xiy' . 

i=l 

With this definition in mind, it is easy to check that the scalar product satisfies the 
following properties of the inner product operator: 

1. < X \ y >—< y \ X >, symmetry. 

2. < ax + Py \ z >— a < x \ z > < y \ z linearity. 

3. < x I x >> , semi-positivity. 

4. < x\x >— 44> X = , positivity. 

A given inner product defines the following norm, 

llxll =< x I X >^^'^ ; 
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that can in turn be used to define the angle between two vectors: 

Q{x,y) = arccos(< x \ y > . 

Let us consider the hnear subspace generated by the columns of a matrix A, m by n, 
m>n: 

C{A) ^{y^Ax,xe i?"} . 

C{A) is called the image of A, and the complement of C{A), N{A), is called the null 
space of A, 

N{A) = {y I A!y = 0} . 
The projection of a vector h e in the column space of A is defined by the relations: 
y = PciA)h ^ye C{A) A{b-y)± C{A) 

or, equivalently, 

y = PciA)b ^y = AxA A'{b -y)=0 . 

In the sequel we assume that A has full rank, i.e., that its columns are linearly in- 
dependent. It is easy to check that the projection of b in C{A) is given by the linear 
operator 

Pa = A{A'A)-^A' . 
liy ^ A{{A'A)-'^A'b), then it is obvious that y e C{A). At the other hand, 

A'{b - y) = A'{I - A{A'A)-^A')b = {A' - IA')b = . 
Orthogonal Matrices 

A real square matrix Q is said to be orthogonal iff its inverse is equal to its transpose, 
that is, Q'Q — I. The columns of an orthogonal matrix Q are a orthonormal basis for 
i?". The quadratic norm of a vector v, given by 



V V 



is not changed by an orthogonal transform, since 

{Qv)'{Qv) = v'Q'Qv = v'lv = v'v . 



Given a vector in i?^, 
linear transform 



Xi 
X2 



, a rotation of this vector by an angle 9 is given by the 



G{e}x 



cos{6) sin(^^) 
-sin(^) cos(^) 





Xi 




. ^2 . 
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A rotation is an orthogonal transform, since 



G{eyG{e} 



cos{9f + sm{9y 






cos(^)^ + sin(^)^ 





"10" 




1 



The Givens rotation is a hnear operator whose matrix is the identity, except for the 
insertion of a bidimensional rotation matrix: 



1 



cos(^) sin(^) 
-sm(e) cos(e) 



The left multiplication of matrix A by a Givens transform, G'A, rotates rows i and j 
of A counterclockwise by an angle 9. Since the product of orthogonal transforms is still 
orthogonal, we can use a sequence of Givens rotations to build more complex orthogonal 
transforms. 

We now define some simple bidimensional rotations that will be used as building blocks 
in the construction of several algorithms. Let us consider, in i?^, a vector v, a symmetric 
matrix S, and an asymmetric matrix A, 



X 

y 



P Q 
q r 



a b 
c d 



In order to set to zero the second component of vector v by means of a left rotation, 
G{9y}' V, it is possible to use the angle 

9y — arctan ^— ^ . 

In order to diagonalize the symmetric matrix by a symmetric rotation, G{9diagY S G{9diag}-i 
it is possible to use the angle 

4 arctan 

2 \r — p 

In order to symmetrize the asymmetric matrix by means of a left rotation, G{9sym}' A, 
it is possible to use the angle 
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Hence, it is possible to diagonalize the asymmetric matrix by means of a symmetriza- 
tion followed by a diagonalization operation. Alternatively, it is possible to use the left 
and right of Jacobi rotations, J{9r}' A J{9i}, defined as follows 

(c + b\ f c ~ b\ 

^ _ ^ 1 , Odif = 6r -6i = arctan ( ^-p- j or 

JiOrY - G{esuj2y G{-daif/2}' , J{di} = G{daif/2} G{Qaifl2) . 

when computing the rotation matrices, there is no need to make explicit use of the 
rotation angles, nor is it necessary to use trigonometric functions, but only to compute 
the factors c = sin(^) and s = sin^, directly as 

X —y 
c — — , , s — 



In order to avoid numerical overflow, one can use the procedure 
• Se y , then c = 1 , s = . 



Se y > X , then t — —x/y , s — l/Vl + , c — st . 
Se y < X , then t = —y/x , c = l/\/l + , s = ct . 



QR Factorization 



Given a full rank real matrix A, m x n, m > n, it is always possible to find an orthogonal 

R 

where is a square upper triangular matrix. This is 



matrix Q such that A — Q 







the QR factorization (or decomposition) of matrix A. The orthogonal factor, Q = [C \ N] 
gives an orthonormal basis for R^, where the first n columns give an orthonormal base 
for C{A), and the last m — n columns give an orthonormal base for N{A), as can be 

R 

In the sequel a QR factorization algorithm 







easily checked by the identity Q'A -- 

is presented. 

The following example illustrates a rotation sequence that takes a 5 x 3 matrix to upper 
triangular form. Every index pair, (i, j), indicates a rotation used to zero the position at 
row i column j. We assume that the original matrix is dense, that is, that the matrix 
has no zero elements, and illustrate the sparsity pattern in the matrix as the algorithm 
progresses. 

(1, 5) * (1, 4)(1, 3)(1, 2) * (2, 5)(2, 4)(2, 3) * (3, 5)(3, 4)* 
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XXX 
XXX 
XXX 
XXX 

X X 



XXX 

X X 

X X 

X X 

X X 



XXX 

X X 

X 

a; 

X 



XXX 

X X 

X 







Least Squares 

Given an over-dctcrmincd system, Ax = b where A is m x n, m > n, vector x* is a least 
squares solution to the system iff x* minimizes the quadratic norm of the residual, that 
is, 

X* = Arg min \\Ax — b\\ , 

Since an orthogonal rotation does not change the square norm of a vector, one can seek 
the least square solution to this system minimizing the residual of the system transformed 
by the orthogonal factor of the QR factorization of A, 

\\Q'{Ax-b)r^ 

Prom the last expression one can see that the solution and the residual of the original 
problem are given by 

X* = R~^c , y = Ax* and z = Q 

Since the last m—n columns of Q are an orthonormal basis of N{A), we see that z _L C{A), 
and can therefore conclude that y = PAb. 



R 




c 




X — 


d 






\Rx - cf ^ px - d\\' 




LU and Cholesky Factorizations 

Given a matrix A, the elementary operation given by the multiplier ml, is the operation 
of subtracting from row i the row j multiplied by ml . The elementary operation applied 
to the identity matrix generates the corresponding elementary matrix, 



M{i,j} 



-m. 
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Applying a elementary operation to matrix A is equivalent to multiplying A from the left 
by the corresponding elementary matrix. 

In the Gaussian elimination algorithm we use a sequence of elementary operations to 
bring A to upper triangular form, 

MA = M{n, n - l}M{n - 1, n - 2}M{n, n - 2} ■ ■ ■ 

M{3, 2} • • • M{n - 1, 2}M{n, 2}M{2, 1} ■ ■ ■ M{n - L l}M{n, 1}A = U . 

Multiplier is computed as the current matrix element at position divided by the 
pivot element at the diagonal position (j, j). Elementary operation M{i,j} is used to 
eliminate (zero) the position (?', j). The elementary operations are performed in an order 
that prevents the zeros created at previous steps to be filled again. 

The next example shows the steps of Gaussian elimination on a small matrix. The 
multipliers, in italic, are stored at the positions corresponding to the zeros they created. 



■ 2 


1 


3 " 




■ 2 


1 


3 " 




" 2 


1 


3 " 


2 


3 


6 


-)■ 


1 


2 


3 




1 


2 


3 


_ 4 


4 


6 . 




. 2 


2 


. 




. 2 


1 


-3 . 



The inverse of the product of this sequence of elementary matrices has the lower 
triangular form, that is, 

M-^ = M-^{n, l}M-\n - 1, 1} ■ ■ ■ M-^{2, l}M-\n, 2}M-^{n - 1, 2} 

• • • M-^{3, 2} • • • M-^{n, n - 2}M-^{n - 1, n - 2}M-^{n, n - 1} . 

1 



L = M-^ - 



n-1 



m: 



n-1 



mz 



n-1 



Therefore the algorithm finds the LU factorization, A = LU. The lower and upper 
triangular form of L and U allow us to easily compute L~^z and U^'^z by simple forward 
and backward substitution. Hence, A~^z = U~^{L~^z) can be computed in just two 
substitution steps. 

In case we factor a symmetric matrix V = LU, we can collect the diagonal elements 
of f/ in a diagonal matrix D, and write V = LDL'. If S is positive definite we can take 
the square roots of the diagonal elements and write D = D^^'^D^/'^. 

Defining C = LD^^^, we have V = CC\ the Cholesky factor of V. For reasons of 
numerical stability, it is recommended to take the square roots of each diagonal elements 
just before we use it as a pivot element, and then eliminate the elements of its column, 
see Pissanetzky (1984). 
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Quadratic Programming 

The quadratic programming problem with equahty constraints is the minimization of the 
objective function 

f{y) = {l/2)y'Wy + c'y , W^W 

with the constraints 

Qiiy) = Nly = di. 

The gradients of / and gi are given by 

V,/ = y'W + c' , and Vyg^ = N[ . 

The Lagrange (first order) optimality conditions state that the constraints are in effect, 
and that objective function gradient equals a linear combination of gradients of the con- 
straint functions, Hence, the solution may be obtained from the Lagrange multipliers, i.e., 
the vector / with the coefficients of the aformentioned linear combination. 

N'y = d A y'W + c' = I' N' , 

or, in matrix form. 



" N' 







' y ' 




' d ' 




N _ 




_ I _ 




c 



These equations are known as the normal system, with a symmetric coefficient matrix. If 
quadratic form W is positive definite, i.e. if Wx x'Wx > A x'Wx = -v^^ a; = 0, and 
the constraint matrix is full rank, the coefficient matrix of the normal system is also 
positive definite. 



SVD Factorization 



The SVD factorization takes a real matrix A, m x n, m > n, to a diagonal matrix, D, by 
left and right multiplication by orthogonal matrices D — U'AV, Let us first consider the 
case m — n, i.e. a square matrix. 

The Jacobi algorithm is an iterative procedure that, at each iterations, "concentrates" 
the matrix in the diagonal by a Jacobi rotation, 

j{i,j,eryA{k}j{z,j,ei} = A{k+i} = 

' A{k+1}\ ■■■ A{k+1}\ ■■■ A{k+1}{ ■■■ Aik+lj^l' 



A{k+1}} 



A{k+1}] 



A{k+l}l 



Mk+l}n 







A{k+iy^ 



A{k+l}i 



A{k+l}l 



A{k+1} 
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Let us consider the sum of squares of of-diagonal elements of A, Off2(A). We can see 
that 

OMA{k+l}) = OS,iA{k}) - iA{k}ir - {A{k}]r 

Hence, choosing at each iteration the index pair that maximizes the sum of squares of the 
corresponding elements, the algorithms converges linearly to a diagonal matrix. 

The Jacobi algorithm gives a constructive proof for the existence of the SVD factor- 
ization, and is the basis of several efficient numerical algorithms. 

If A is a rectangular matrix, one can first find its QR factorization, and then apply 
Jacobi algorithm to the upper triangular R factor. If A is square and symmetric, the 
obtained factorization is known as the eigenvalue decomposition of A. 

The orthogonal matrices U and V can be interpreted as orthonormal bases in the 
respective m and n dimensional spaces. The values at the diagonal of S are called the 
singular values of matrix A, and can be interpreted geometrically as the scaling factors of 
the map A = UDV, taking each versor of the basis V to a scaled versor of the basis U. 



Complex Matrices 



Many techniques developed in this section for real matrices can be generalized to complex 
matrices. Practical and elegant methods of obtaining and describing such generalizations 
are the described by Hemkumar (1994) using Cordic transforms (Coordinate Rotation 
Digital Computer) . Such a transform is applied to a 2 x 2 complex matrix M in the form 
of internal and external rotations pairs, 



C(0) -5(0) 
5(0) C(</.) 



e{ia) 
e(i/3) 



Ae{ia) 
Ce{ic) 



Be{ih) 
De{id) 



e(i7) 






e{iS) 



c(^) -s(^) 
s(V') c(V') 



The elegance of these Cordic transforms comes from the following observations: The 
internal transform affects only the imaginary exponents of the matrix elements, while the 
external transform can be independently applied to the real and the imaginary parts of 
the matrix, that is. 



e{ia) 
e{i/3) 



Ae{ia) 
Ce{ic) 



Be{ih) 
De{id) 



6(27) 
e{i5) 



Ae{ia') Be{ih') 
Ce{id) De{id') 

c(</.) -5(0) 

5(0) C(0) 



Ae{i{a + a + -f)) Be{i{b + a + S)) 
Ce(i(c + /3 + 7)) De{i{d + I3 + ^)) 



AL + iA' BL + iB' 



C(0) -5(0) 
5(0) C(0) 



A^ B^ 
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+i 



c(0) -5(0) 
s(0) c(0) 



The following table defines some useful internal and external transforms. Type I 
transforms change the imaginary exponents of the matrix elements at one of the diagonals. 
Transforms of Type R, C and D make real the elements in a row, column or diagonal. 



Type 


Value 




a = -13 = j = -5 = {d-a)/2 


^off 


a = -/3 = -7 = 5= (c-6)/2 


Rup 


a = I3 = ~{b + a)/2 ; ^ = -S={b- a)/2 


Rlow 


a = l3 = -{d + c)/2 ; ^ = -S={d- c)/2 


Cleft 


a = -/3 = (c-a)/2 ; 7 = ^ = -(c + a)/2 




a^-p^(d-b)/2 ; 7 = 5 = -(d + 6)/2 


Dmain 


a = ^ = -(d + a)/2 ; 7 = -5 = (d - a)/2 


Doff 


a = ^ = -(6 + c)/2 ; 7 = -5 = (6 - c)/2 



It is easy to see that a sequence of internal transforms is equivalent to a single internal 
transform whose parameters are the sum of the coresponding parameters of the transforms 
in the sequence. 

Combining internal and external transforms, it is possible to create HT's for several 
interesting algorithms. For example, the HT's of type I, II and III in the following table 
can be used to obtain the SVD factorization of a complex matrix, much like the Jacobi 
algorithm. A type I transform applies Riow followed by a rotation, making the matrix up- 
per triangular. A type II transform applies Dmaim loff followed by a diagonalization. For 
Hermitian (self-adjoint) matrices, the diagonalization is obtained using only one transform 
of type III 



Type 


Internal 


External 


I 


q; = ^ = -(d + c)/2 ; 7 = -5= (d-c)/2 


= 0; V' = arctan(C/D) 


II 


Q; = -(a + 6)/2; ^ = 7 = -5 = (fe - a)/2 


± -0 = arctan(5/(D ^ A)) 


III 


a = -/? = -7 = 5 = -6/2 


= -0 = arctan(2S/(D - A))/2 



Exercises 

1. Use the fundamental properties of the inner product to prove that: 

(a) The Cauchy-Scwartz inequality: | < x \ y > \ < Suggestion: Compute 
\\x — ay\f for a —< x \ y >^ /WvW- 

(b) The triangular inequality: \\x + y\\ < \\x\\ + \\y\\. 
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(c) In which case do we have equahty or strict Cauchy-Schwartz inequahty? Relate 
your answer to the definition of angle between two vectors. 

2. Use the definition of inner product in i?'* to prove the parallelogram law: + + 
11^ ~ y\f — 2||2:||^ + 2||y||^. 

3. A matrix is idempotent, or a non-orthogonal projector, iff P^ — P. Prove that: 

(a) R = {I — P) is idempotent. 

(b) R"" ^C{P) + C{R). 

(c) All eigenvalues of P are or +1. Suggestion: Show that if is a root of the 
characteristic polynomial of P, (fp{X) = det(P — XI), than (1 — A) = 1 is a 
root of (Pr{X)- 

4. Prove that VP idempotent and symmetric, P — Pc(p)- Suggestion: Show that 
P'{I - P) = 0. 

5. Prove that the projection operator into a given vector subspace, V, Py, is unique 
and symmetric. 

6. Prove Pythagoras theorem: \/b e R^,u e V we have ||6 — = ||6 — Pi/6||^ + 
\\Pvb-uf. 

7. Assume we have the QR factorization of a matrix A. Consider a new matrix. A, 
obtained from A by the substitution of a single column. How could we update 
our orthogonal factorization using only 3n rotations? Suggestion: (a) Remove the 
altered column of A and update the factorization using at most n rotations, (b) 
Rotated by the new column by the current orthogonal factor, a = Q'a = R~*^A'a. 
(c) Add a as the last column of A, and update the factorization using 2n rotations. 

8. Compute the LDL and Cholesky factorizations of matrix 

" 4 12 8 12 " 

12 37 29 38 

8 29 45 50 

12 38 50 113 

9. Prove that: 

(a) {ABy = B'A'. 

(b) {AB)-^ ^B-^A-\ 

(c) A-' = {A-^y = {A!)-\ 

10. Describe four algorithms to compute L~^x and L~*x, accessing the unit diagonal 
and lower triangular matrix L row by row or column by column. 



F.3. SPARSE FACTORIZATIONS 



369 



F.3 Sparse Factorizations 

As indicated in chapter 4, we present in this appendix some aspects related to the sparse 
factorization. This material has strong connections with the issues discussed in chapter 
4, but is more mathematical in its nature, and can be omitted by the reader interested 
mostly in the purely epistemological aspects of decoupling. 

Computing the Cholesky factorization of a n x n matrix involves on the order of 
arithmetical operations. Large models may have thousands of variables, so it seems 
that decoupling large models requires a lot of work. Nevertheless, in practice, matrices 
appearing in large models are typically sparse and structured. A matrix is called sparse if 
it has many zero elements, otherwise it is called dense. A sparse matrix is called structured 
if its non-zero-elements (NZEs) are arranged in a "nice" pattern. As we will see in the 
next sections, we may be able to obtain a Cholesky factor, L, of a (permuted) sparse and 
structured matrix V, that 'preserves' some of its sparsity and structure, hence decreasing 
the computational work. 

F.3.1 Sparsity and Graphs 

In the discussion of sparsity and structure, the language of graph theory is very helpful. 
This section gives a quick review of some of the basic concepts on directed and undirected 
graphs, and also defines the process of vertex elimination. 

A Directed Graph, or DC, Q = {V,A) has a set of vertices or nodes, V, indexed by 
natural numbers, and a set or directed arcs. A, where each arc joins two vertices. We say 
that arc G A goes from node i to node j. When drawing a graphical representation 
of a DG, it is usual to represent vertices by dots, and arcs by a arrows between the dots. 
In a DG, we say that i is a parent of j, i e pci{j), or that j is a child of i, j e ch{i), if 
there is an arc going from i to j. The children of i, the children of its children, and so on, 
are the descendents oi i. If j is a descendent of i we say that there is a path in Q going 
from i to j. A cycle is a path from a given vertex to itself. An arch from a vertex to 
itself, (j, j) is called a loop. In some situations we spare the effort of multiple definitions 
of essentially the same objects by referring to the same graph with or without all possible 
loops. 

There is yet another representation for a DG, Q, given by (V, B), where the adjacency 
matrix, B, is the Boolean matrix B{i,j) = 1 if arc G A, and otherwise. The 

key element relating the topics presented in this and the previous section, is the Boolean 
matrix B indicating the non-zero elements of the numerical matrix A, Bj = I{Al ^ 0). 
In this way, the graph ^ = (V, S) is used to represent the sparsity pattern of a numerical 
matrix A. 

A Directed Acyclic Graph, DAG, has no cycles. A separator S dV separates i from 
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j if any path from i to j goes through a vertex in S. A vertex j is a spouse of vertex i, 
j G fii), if they have a child in common. A tree is a DAG where each vertex has exactly 
one parent, except for the root vertex, that has no parent. The leafs of a tree are the 
vertices with no children. A graph composed by several trees is a forest. 

An Undirected Graph, or UG, is a DG where, if arc {i,j) is in the graph, so is its 
opposite, (j, i). An UG can also be represented as ^ = (V, £), where each undirected 
edge, {i,j} G S, stands for the pair of opposite directed arcs, {i,j) and Obviously, 
the adjacency matrix of a UG is a symmetric matrix, and vice- versa. 

1 3 5 13-5 3 

1 \ , I / I \ 

2 ^ A ^ 6 2-4-6 2 4 6 

Figure 2: A DAG and its Moral Graph. 



The moral graph of the DAG G, M.{Q), is the undirected graph with the same nodes 
as Q, and edges joining nodes i and j if they are immediate relatives in Q. The immediate 
relatives of a node in Q include its parents, children and spouses (but not brooders or 
sisters). The set of immediate relatives of i is also called the Markov blanket of i, fn{i), 
hence, j G fn{i) if j is a neighbor of i in the moral graph. Figure 2 represents a DAG, its 
moral graph, and the Markov blanket of one of its vertices. 

Sometimes it is important to consider an order on the vertex set, established by an 'in- 
dex vector' q, in (a subset of) V = {1, 2, . . . N}. For example, we can consider the natural 
order q = [1, 2, ... A^], or the order given by a permutation, q = [q{l), ^'(2), . . . q{N)]. 

In order not to make language and notation too heavy, we may refer to the vertex 
'set' q, meaning the set of elements in vector q. Also, given two index vectors, a = 
[a(l), . . . a{A)] and b = [b{l), . . . b{B)], the index vector c = aU b, has all the indices in 
a or b. Similarly, c = a\b has all the indices in a that are not in b. These are essentially 
set operations but, since an index vector also estabhshes an order of its elements, c = 
[c(l), . . . c(C)], this order, if not otherwise indicated, has somehow to be chosen. 

We define the elimination process in the UG, Q = {V,S), V = {!,... A^} given an 
elimination order, q = [q{l), . . . ^(A^)], as the sequence of elimination graphs Qk — (Vfc, Sk) 
where, for A; = 1 ... n. 



Vk^ {(l{k),q{k + l),...q{n)}, E\^E, and, for /c > 1 , 



{i, j} G Ek-\ , or 

{q{k - l),i} G Ek-\ and {q{k - 1), j} G Ek-\ ■ 

that is, when eliminating vertex g(A;), we make its neighbors a clique^ adding all missing 
edges between them. 
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The filled graph is the graph (V, J-"), where J-" = Ul^iSk. The original edges and the 
filled edges in are, respectively, the edges in £ and in J^\S. 

Figure 3 shows a graph with 6 vertices, the elimination graphs, and the filled graph, 
considering the elimination order q — [1, 3, 6, 2, 4, 5]. 



1-3 3 1-3 

|x \ /l\2-62 4|x|\ 

2 6|2-6| x| |\ I 2-6| 

/ / / / 5 45-45 |x|/ 

5 4 5 4 5 -4 



Figure 3: Ehmination Graphs. 



There is a computationally more efficient form of obtaining the filled graph, known 
as simplified elimination: In the simplified version of the elimination graphs, Ql, when 
eliminating vertex q{k), we add only the clique edges incident to its neighbor, q{l), that 
is next in the elimination order. Figure 4 shows the simplified elimination graphs and 
the filled graph corresponding to the elimination process in Figure 3; The vertex being 
eliminated is in boldface, and his next (in the elimination order) neighbor in italic. 

1-3 3 1-3 

|x \ / \ \ 2 - 6 2 4|x|\ 

2 6|2 6 \ /||\ |2-6| 

/ / / / 5 45 4 5 \ X \ / 

5 4 5 4 5 -4 

Figure 4: Simplified Ehmination Graphs. 



An elimination order is perfect if it generates no fill. Perfect elimination is the key 
to relate the vertex ehmination process to the theory of chordal graphs, see Stern (1994). 
Chordal graph theory provides a unified framework for similar elimination processes in 
several other contexts, see Golumbic (1980) Stern (1994) and Lauritzen (2006). Never- 
theless, we will not explore this connection any further in this paper. 

The material presented in this section will be used in the next two sections for the 
analysis of the sparsity structure in Cholesky factorization and Bayesian networks. This 
structure is the key for efficient decoupling, allowing the computation of large models, used 
in the analysis of large systems. These structural aspects have been an area of intense 
research by the designers of efficient numerical algorithms. However, the same area has 
not been able to attract so much interest in statistical modeling. From the epistemological 
considerations in the following chapters, we hope to convince the reader that this is a topic 
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that deserves to receive much more attention from the statistical modeler. 



F.3.2 Sparse Cholesky Factorization 

Let us begin with some matrix notation. Given a matrix A, and index vectors p and 
q, the equivalent notations A{p, q) or indicate the (sub) matrix of rows and columns 
extracted from A according to the indices in p and q. In particular, if p and q have single 
indices, i and j, A{i,j) or Aj indicate the element of A in row i and column j. The next 
example shows a more general case: 



p 



2 
3 
1 



3 
2 





' 11 


12 


13 " 




' 23 


22 " 


A = 


21 


22 


13 


) — 


33 


32 




31 


32 


33 




13 


12 



If g = [^'(1), . . . q{N)] is a permutation of [1, . . . A^], and / is the identity matrix, Q = Iq 
and Q' = I'^ are the corresponding row and column permutation matrices. Moreover, if 
A a, N X N matrix, Ag = QA and A"^ = AQ'. The symmetric permutation of A in order 
q is A{q,q) = QAQ'. 

Let us consider the covariance structure model of section 3. If we write the variables 
of the model in a permuted order, q, the new covariance matrix is V{q, q). The statistical 
model is of course the same, but the Cholesky factor of the two matrices may have a quite 
a different sparsity structure. 

Figure 5 shows the positions filled in the Cholesky factorization of a matrix A, and 
in the Cholesky factorization of two symmetric permutation of the same matrix, A{q, q). 
Initial Non Zero Elements, NZEs, are represented by x, initial zeros filled during the 
factorization are represented by 0, and initial zeros left unfilled are represented by blank 
spaces. 
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Figure 5: Filled Positions in Cholesky Factorization. 



The next lemma connects the numerical elimination process in the Cholesky factor- 
ization of a symmetric matrix A, to the vertex elimination process in the UG having as 
adjacency matrix, B, the sparsity pattern of A. 
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Elimination Lemma: When eliminating the j-th column in the Cholesky factorization 
of matrix A{q, q) = LL'. we fill the positions in L corresponding to the filled edges in J-" 
at the elimination of vertex q{j). 

Given a matrix A, Q = (y,S), an elimination order q, and the respective filled graph, 
let us consider the set of row indices of NZE's in , the j — th column of the Cholesky 
factor, L \ QAQ' = LL': 



nze(L^) = {i\i>jA {q{i), q{j)} e J^} + {j} . 



6^5 

\ 4 ^ 3 ^ 2 1 



6 ^ 5 4 / 2 

; , 6 ^ 5 ^ 3 
1 ^ 2 ^ 3 \ 4 ^ 1 



Figure 6: Elimination Trees. 
We define the elimination tree, H, by 



Mi) 



j, if nze(LJ) = {j}, or 

min{i > j | i e nze(L-')} , otherwise . 



where h{j), the parent of j in "H, is the first (non diagonal) NZE in column j of L. Figure 
6 shows the elimination trees corresponding to the examples in Figure 5. 

Elimination Tree Theorem: For any row index i bellow the diagonal in column j 
of L, j is a descendant of i in the elimination tree, that is, for any i > j | i e nze(L-^), the 
is a path in H going from i to j. 

Proof (see Figure 7): If i = h{j), the result is trivial. Otherwise, (see Figure 7), let 
k = h{j). But Ll ^ ALi ^ ^ Lf ^ 0, because {q{j) , q{i}} , {q{j) , q{k)} G 



1 



{q{k),q{i)} G fj+i. Now, either i = h{k), or, applying the argument recursively, we trace 
a branch of {i,l, . . .k,j), i > I > . . . > k > j. QED. 
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3 

X ... k 

I 

• • X . . . i 



n 

Figure 7: A Branch in the Ehmination Tree. 



Prom the proof of the last theorem we see that the ehmination tree portrays the 
dependencies among the columns for the numeric factorization process. More exactly, we 
can eliminate column j of A. i.e. compute all the multipliers in column j, M\ and update 
all the elements affected by these multipliers, if and only if we have already eliminated all 
the descendents of j in the elimination tree. 

If we are able to perform parallel computations, we can simultaneously eliminate all the 
columns at a given level of the elimination tree, beginning with the leaves, and finishing at 
the root. Example 4 considers the elimination of a matrix with the same sparsity pattern 
of the last permutation in example 1. Its elimination tree is the last one presented at 
Figure 6. This elimination tree has three levels that, from the leaves to the root, are: 
{1,3,2}, {4,5}, e {6}. 

Hence, we can perform a Cholesky factorization with this sparsity pattern in only 2 
steps, as illustrated in the following numerical example: 



3 6 9 

7 53 2 

8 6 49 23 

9 2 23 39 



1 7 

2 8 

3 6 9 

7 4 2 

4 2 5 5 

5 2 5 12 



1 7 
2 8 

3 6 9 

7 4 2 

4 2 5 5 

5 I i 6 



The sparse matrix literature has many heuristics designed for finding good elimination 
orders. The example in Figures 8 and 9 show a good elimination order for a 13 x 13 sparse 
matrix. 
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Figure 8: Gibbs Heuristic's Elimination Order. 



The elimination order in Figure 8 was found using the Gibbs heuristic, described in 
Stern (1994, ch.6) or Pissanetzky (1984, ch.x). The intuitive idea of Gibbs heuristic, see 
Figure 9, is as follows: 1- Starting from a 'peripheral' vertex, in our example, vertex 
3; 2- Grow a breath-first tree T in Q. Notice that the vertices at a given level, /, of 
T form a separator, Si, in the graph Q. 3- Chose a separator, Si, that is 'small', i.e. 
with few vertices, and 'central', i.e. dividing Q in 'balanced' components. 4- Place in q, 
first the indices of each component separated by Si, and, at last, the vertices in Si. 5- 
Proceed recursively, separating each large component into smaller ones. In our example, 
we first use separator 5*5 = {4,5}, dividing Q in three components, Ci = {3,8,1,10,9} 
C2 = {12, 13, 2, 7, 6} C3 = {11}. Next, we use separators ,^3 = {9} in Ci, and ,^7 = {6} 
in C2. 

The main goal of the techniques studied in this and the last section is to find an 
elimination order filling as few positions as possible in the Cholesky factor. Once the 
elimination order has been chosen, simplified elimination can be used to prepare in ad- 
vance all the data structures holding the sparse matices, hence separating the symbolic 
(combinatorial) and numerical steps of the factorization. This separation is important in 
the production of high performance computer programs. 
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Figure 9: Nested Dissection by Gibbs Heuristic. 



F.4 Bayesian Networks 

The objective of this section is to show that the sparsity techniques described in the last 
two section can be apphed, almost immediately, to an other important statistical model, 
namely, Bayesian networks. The presentation in this section follows very closely Cozman 
(2000). A Bayesian network is represented by a DAG. Each node, i, represents a random 
variable, Xj. Using the notation established in section 9, we write i ^ n , where n is the 
index vector n = [1, 2, ... A^]. The DAG representing the Bayesian network has an arc 
from node i to node j if the probability distribution of variable xj is directly dependent 
on variable Xi. 

In many statistical models that arc is interpreted as a direct influence or causal effect 
of Xi on Xj. Technically, we assume that the joint distribution of the vector x is given in 
the following product form. 

JkzTI 

The important property of Markov blankets in a Bayesian network is that, given the 
variables in its Markov blanket, a variable Xi is conditionally independent of any other 
variable, Xj, in the network, that is, the Markov blanket of a variable 'decouples' this 
variable from the rest of the network, 

p(Xj I 2^f7j;(i), Xj^ pi^i I 3Jryi(i)) . 

Inference in Bayesian networks is based on queries, where the distribution of some 
'query' variables, Xg, q = [q{l), ■ ■ ■ q{Q)], is computed, given the observed values of some 
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'evidence' variables, x^, e = [e(l), . . . e{E)]. Such queries are performed eliminating, that 
is marginalizing, integrating, or summing out, all the remaining variables, Xs, that is. 

We place the indices of the variables to be eliminated in the elimination index vector, 
s — r\(g U e). For now, let us consider the 'requisite' index vector, r, as being just a 
permutation (reordering) of the original indices in the network, that is, r = [r(l), . . . r{R)], 
R = N. The 'elimination order' or 'elimination sequence', s = . . . 5(5*)], will play 

an important role in what follows. 

Let us mention two technical points: First, not all variables of the original network 
may be needed for a given query. If so, the indices of the unnecessary ones can be removed 
from the requisite index vector, and the query is performed involving only a proper subset 
of the original variables, hence, R < N. For example, if the network has disconnected 
components, all the vertices in components having no query variables are unnecessary. 
Second, the normalization constant of distributions that appear in intermediate compu- 
tations are costly to obtain and, more important, not needed. Hence, we can perform this 
intermediate computations with un-normalized distributions, also called 'potentials'. 

Making explicit use of the elimination order, s = . . . s(S')], we can write the last 
equation as 

p{Xq\Xe) = y^^ •••V' p{Xr{l)\Xpa{r(l))) X ■ ■ ■ X p{XriR)\Xpa(r{R))) ■ 

^ '^s(S) ^ '^s(l) 

Because Xs[i) can only appear in densities p{xj \ XpaQ)) for j = s{l) or j e c/i(s(l)), we 
can separate the first summation, writing 

x(i:,„„n,,^,.w)u,(„p(^'i^-o)>) • 

Eliminating, i.e. integrating out, the first variable in the elimination order, Xs{i), we 
create a new (joint) potential of the children of the eliminated variable, given its parents 
and spouses, that is, 

piXeHisH) I X,„(.(i),X^(,(i))) = Y.^^^^^ n,ee.(.(l))U.(l)^^^^- ' 

Next we eliminate Xs{2), that is, we collect all potentials containing Xs(2), form their 
joint product, and marginalize on Xs{2)- We proceed in the elimination order eliminating 
Xs(3),Xs{4:) ■ ■ ■Xs(s), at which point the normalized potentials left give us the distribution 

P{Xq\Xe). 
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We refer to the variables appearing in a joint potential as that potential's cluster. 
Forming a joint potential is a computation of a complexity that is exponential in the 
size of its cluster. Hence, it is vital to chose an elimination order that keeps the cluster 
sizes as small as possible. But the clusters formed in the elimination process of a BN 
correspond to the cliques appearing in the elimination graphs, as defined in the last two 
sections. Hence all techniques and heuristics used for finding a good elimination order 
for Cholesky factorization can be used to obtain a good elimination order for querying 
a BN. Also, all the abstract combinatorial structures appearing in sparse Cholesky fac- 
torization, like elimination trees, have their analogues for computation in BNs. Cozman 
(2000) develops the complete theory of BNs in a very simple and intuitive way, a way 
that naturally highlights this analogy. Other authors have already commented on the 
similarities between several graph decomposition algorithms, see for example Lauritzen 
(2006, Lecture 4, Probability propagation and related algorithms) for a very general and 
abstract, but highly mathematical overview. 



Appendix G 

Monte Carlo Miscellanea 



Monte Carlo or, if necessary, Markov Chain Monte Carlo, is the basic tool we use for 
numerical integration. There are several excellent books on the subject. Hammersley 
and Handscomb (1964) is a short and intuitive introduction, including some important 
topics not usually covered at this level, like pseudo-random and quasi-random generators, 
importance sampling and other variance reduction techniques, and the solution of linear 
systems. This book is now out of print, but has the advantage of being freely available for 
download at the internet. Ripley (1987) is an other excellent text covering this material 
that is still in print. Gilks et al. (1996) gives several excellent and up-to-date review 
papers on areas that are of interest for statistical modeling. There is a vast literature on 
MC and MCMC written by physicists. It contains many original, interesting and useful 
ideas, but sometimes it employs a terminology that is unfamiliar to statisticians. The 
article of Meng and Wong (1996) can help to overcone this gap. 

G.l Pseudo, Quasi and Subjective Randomness 

The implementation of Monte Carlo methods, as described in the following sections, 
requires a random number generator of i.i.d (independent and identically distributed) 
random variables uniformly distributed in the unit interval, [0, 1[. From this basic uniform 
generator one gets a uniform generator in the (i-dimensional unit box, [0, 1['^ and, from 
there, non-linear generators for many other multivariate distributions. 

Random and Pseudo-Random Generators 

The concept of randomness is usually applied to a variable (to be) generated or observed 
process involving some uncertainty, as in the definition presented by Hammersley and 
Handscomb (1964, p. 10): 
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"A random event is an event which has a chance of happening, and prob- 
ability is a numerical measure of that chance. " 

Monte Carlo, and several other applications, require a random number generator. 
With the last definition in mind, engineering devices based on sophisticated physical 
processes have been built in the hope of offering a source of "true" random numbers. 
However, these special devices were cumbersome, expensive, not portable nor universally 
available, and often unreliable. Moreover, practitioners soon realized that simple deter- 
ministic sequences could successfully be used to emulate a random generator, as stated 
in the following quotes (our emphasis) by Hammersley and Handscomb (1964, p. 26) and 
Ripley (1987, p.l5): 

"For electronic digital computer it is most convenient to calculate a se- 
quence of numbers one at a time as required, by a completely specified rule 
which is, however, so devised that no reeisonable statistical test will detect 
any significant departure from randomness. Such a sequence is called pseudo- 
random. The great advantage of a specified rule is that the sequence can be 
exactly reproduced for purposes of computational checking. " 

"A sequence o/ pseudorandom numbers (Ui) is a deterministic sequence of 
numbers in [0, 1] having the same relevant statistical properties as a sequence 
of random numbers. " 

Many deterministic random emulators used today are Linear Congruential Pseudo- 
Random Generators (LCPRG), as in the following example: 

Xi+i = {axi -\- c) mod m , 

where the multiplier a, the increment c and the modulus m should obey the conditions: 
(i) c and m are relatively prime; (ii) a — 1 is divisible by all prime factors of m; (iii) a — 1 
is a multiple of 4 if m is a multiple of 4. LCPRG's are fast and easy to implement if m 
is taken as the computer's word range, 2*, where s is the computer's word size, typically 
s = 32 or 8 = 64. The LCPRG's starting point, xq, is called the seed. Given the same 
seed the LCPG will reproduce the same sequence, what is very important for tracing, 
debugging and verifying application programs. 

However, LCPRG's are not an universal solution. For example, it is trivial to devise 
some statistics that will be far from random, see Marsaglia (1968). There the impor- 
tance of the words reasonable and relevant in the last quotations becomes clear: For 
most Monte Carlo applications these statistics are irrelevant. LCPRG's can also exhibit 
very long range auto-correlations and, unfortunately, these are more likely to affect long 
simulated time series required in some special applications. The composition of several 
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LCPRG's by periodic seed refresh may mitigate some of these difficulties, see Pereira and 
Stern (1999b). LCPRG's are also not appropriate to some special applications in cryp- 
tography, see Boyar (1989). Current state of the art generators are given in Matsumoto 
and Kurita (1992,1994) and Matsumoto and Nishimura (1998). 

Chance is Lumpy - Quasi-Random Generators 

"Chance is Lumpy" is Robert Abelson's First Law of Statistics, see Abelson (1995, p.xv). 
The probabilistic expectation is a hnear operator, that is, E{Ax + b) — AE(x) + b, where 
X in random vector and A and b are a determined matrix and vector. The Covariance 
operator is defined as Cov(a;) = E{{x — E{x)) ® {x — E{x))). Hence, Cov{Ax + b) — 
ACov{x)A' . Therefore, given n i.i.d. scalar variables, Xi \ Yax{xi) — cr^, the variance of 
their mean, m — {l/n)l'x, is given by 

il'd,ag(a^l)il = [i 1 ... 1] 

Hence, the mean's standard deviation is std(m) = cr/^/ (n). So, mean values of iid random 
variables converge to their expected values at a rate of 1/ \/{n) . 

Quasi-random sequences are deterministic sequences built not to emulate random se- 
quences, as pseudo-random sequences do, but to achieve faster convergence rates. For 
d-dimensional quasi-random sequences, an appropriate measure of fluctuation, called dis- 
crepancy, only grows at a rate of log(n)'^, hence growing much slower than ^/ln). There- 
fore, the convergence rate corresponding to quasi-random sequences, log(n)'^/n, is much 
faster than the one corresponding to (pseudo) random sequences, A^n)/n. Figure 1 al- 
lows the visual comparison of typical (pseudo) random (left) and quasi-random (right) 
sequences in [0, Ip. By visual inspection we see that the points of the quasi-random se- 
quence are more "homogeneously scattered" that is, they do not "clump together" , as the 
point of the (pseudo) random sequence often do. 

Let us consider an axis-parallel rectangles in the unit box, 

= [ai, 6i[ X [a2, b2[x ... [a^, bd[ C [0, if . 

The discrepancy of the sequence Si,n in box R, and the overall discrepancy of the sequence 
are defined as 

D{si:n,R) =nVol{R) -\si..nnR\ , D{si:n)= SUp \D{si..n, R)\ . 

Re[o,i['* 

It is possible to prove that the discrepancy of the Halton-Hammersley sequence, defined 
next, is of order 0(log(n)'^"^), see Matousek (1991, ch.2). 
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Halton-Hammersley sets: Given d — 1 distinct prime numbers, p(l),p(2), . . .p[d — 1), 
the i-th point, x\ in the Halton-Hammersley set, {x^, x^, . . . x"'}, is 

x' = [i/n, Tp^i) (i) , rp(2) (i) , . . . Tp^^d-i) [i 
i = ao+p{k)ai +p{k)^a2 +p{k)^a3 + . . . 



, for i = 1 : n — 1 , where 



ao 



+ 



ai 



+ 



a2 



p{k) p{kY p{k) 



+ ... 



That is, the {k + l)-th coordinate of x\ xl_^_i — rp(fc)(i), is obtained by the bit reversal of 
i written in p(A;)-ary or base p{k) notation. 

The Halton-Hammersley set is a generalization of van der Corput set, built in the 
bidimensional unit square, d = 2, using the first prime number, p = 2. The following 
example, from Hammersley (1964, p. 33) and Giinther and Jiingel (2003, p. 117) builds the 
8-point van der Corput set, expressed in binary and decimal notation, 
function x= corput (n,b) 
% size n base b v. d. corput set 
in=f loor(log(n)/log(b)) ; 
u=l:n; D=[]; 
for i=0:in 
d= remCujb) ; 
u= (u-d)/b; 
D= [D; d]; 
end 

x=((l./b') ."(1: (m\mal)))*D; 
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Figure G.l: (Pseudo)-random and quasi-random point sets on the unit box 

Quasi-random sequences, also known as low-discrepancy sequences, can substitute 
pseudo-random sequences in some applications of Monte Carlo methods, achieving higher 
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accuracy with less computational effort, see Merkel (2005), Okten (f999) and Sen, Samanta 
and Reese (2006). Nevertheless, since hj design the points of a quasi-random sequence 
tend to avoid each other, strong (negative) correlations are expected to appear. In this 
way, the very reason that can make quasi-random sequences so helpful, can ultimately 
impose some limits to their applicability. Some of these problems are commented by 
Morokoff (1998, p.766): 

"First, quasi-Monte Carlo methods are valid for integration problems, but may 
not be directly applicable to simulations, due to the correlations between the 
points of a quasi-random sequence. ... A second limitation: the improved 
accuracy of quasi-Monte Carlo methods is generally lost for problems of high 
dimension or problems in which the integrand is not smooth. " 

Subjective Randomness and its Paradoxes 

When asked to look at patterns like those in Figure 1, many subjects perceive the quasi- 
random set as "more random" than the (pseudo) random set. How can this paradox be 
explained? This was the topic of many psychological studies in the field of subjective 
randomness. The quotation in the next paragraph is from one of these studies, namely, 
Falk and Konold (1997, p. 306, emphasis are ours): 

"One major source of confusion is the fact that randomness involves two 
distinct ideas: process and pattern (Zabell, 1992). It is natural to think 
of randomness as a process that generates unpredictable outcomes (stochastic 
process according to Gell'Mann, 1994)- Randomness of a process refers to 
the unpredictability of the individual event in the series (Lopes, 1982). This 
is what Spencer Brown (1957) calls primciry randomness. However, one 
usually determines the randomness of the process by means of its output, which 
is supposed to be patternless. This kind of randomness refers, by definition, 
to a sequence. It is labeled secondary randomness by Spencer Brown. It 
requires that all symbol types, as well as all ordered pairs (diagrams), ordered 
triplets (trigrams)... n-grams in the sequence be equiprobable. This definition 
could be valid for any n only in infinite sequences, and it may be approximated 
in finite sequences only up to ns much smaller than the sequence 's length. The 
entropy measure of randomness (Attneave, 1959, chaps. 1 and 2) is based on 
this definition. 

These two aspects of randomness are closely related. We ordinarily expect 
outcomes generated by a random process to be patternless. Most of them are. 
Conversely, a sequence whose order is random supports the hypothesis that it 
was generated by a random mechanism, whereas sequences whose order is not 
random cast doubt on the random nature of the generating process. " 
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Spencer- Brown was intrigued by the apparent incompatibility of the notions of primary 
and secondary randomness. The apparent colUsion of these two notions generates several 
interesting paradoxes, taking Spencer-Brown to question the applicability of the concept 
of randomness to probability and statistical analysis, see Spencer-Brown (1953, 1957) and 
Flew (1959), Good (1958) and Mundle (1959). See also Henning (2006), Kaptchuk and 
Kerr (2004), Utts (1991), and Wassermann (1955). In fact, several subsequent psycholog- 
ical studies were able to confirm that, for many subjects, the intuitive or common-sense 
perception of primary and secondary randomness are quite discrepant. However, a care- 
ful mathematical analysis makes it possible to reconcile the two notions of randomness. 
These are the topics discussed in this section. 

The relation between the joint and conditional entropy for a pair of random variables, 
see appendix E.2, 



motivates the definition of first, second and higher order entropies, defined over the dis- 
tribution of words of size m in a string of letters from an alphabet of size a. 



It is possible to use these entropy measures to access the disorder or lack of pattern 
in a given finite sequence, using the empirical probability distributions of single letters, 
pairs, triplets, etc. However, in order to have a significant empirical distribution of Tri- 
plets any possible m-plet must be well represented in the sequence, that is, the word size, 
m, is required to be very short, relative to the sequence log-size, m « log„(n). 

In the article of Falk and Knold (1997), Figure 2 displays the typical perceived or 
apparent randomness of Boolean (0-1) bit sequences or black-and-white pixel grids versus 
the second order entropy of the same strings and grids, see also Attneave (1959). Clearly, 
there is a remarkable bias of the apparent randomness relative to the entropic measure. 

" When people invent superfluous explanations because they perceive pat- 
terns in random phenomena, they commit what is known in statistical parlance 
as Type I error The other way of going awry, known as Type H error, occurs 
when one dismisses stimuli showing some regularity as random. The numerous 
randomization studies in which participants generated too many alternations 
and viewed this output as random, as well as the judgments of overalternating 
sets as maximally random in the perception studies, were all instances of type 
II error in research results." Falk and Konold (1997, p. 303). 



H{i,j) = H{j) + H{i I j) = H{i) + H{j I i) , 
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This effect is also known as the gambler's fallacy when betting on cool spots, expecting 
the random sequence to "compensate" finite average fluctuations from expected values. 
Of course, some gamblers exhibit the opposite behavior, preferring to bet on hot spots, ex- 
pecting the same fluctuations to reoccur. These effects are the consequence of a perceived 
coupling, by a negative or positive correlation or other measure of association, between 
non overlapping segments that are in fact supposed to be decoupled, uncorrelated or have 
no association, that is, to be Markovian. For a statistical analysis, see Bonassi et al. 
(2008). A possible psychological explanation of the gambler's fallacy is given by the con- 
structivist theory of Jean Piaget, see Piaget and Inhelder (1951), in which any "lump" in 
the sequence is (miss) perceived as non-random order: 

"In analogy to Piaget's operations, which are conceived as internalized ac- 
tions, perceived randomness might emerge from hypothetical action, that is, 
from a thought experiment in which one describes, predicts, or abbreviates the 
sequence. The harder the task in such a thought experiment, the more random 
the sequence is judged to be." Falk and Konold (1997, p. 316). 

The same hierarchical decomposition scheme used for higher order conditional entropy 
measures can be adapted to measure the disorder or patternless of a sequence, relative to 
a given subject's model of "computer" or generation mechanism. In the case of a discrete 
string, this generation model could be, for example, a deterministic or probabilistic Turing 
machine, a flxed or variable length Markov chain, etc. It is assumed that the model is reg- 
ulated by a code, program or vector parameter, 9, and outputs a data vector or observed 
string, X. The hierarchical complexity measure of such a model emulates the Bayesian 
prior and conditional likelihood decomposition, H{p{6,x)) = H{p{6)) + H{p{x \ 6)), that 
is, the total complexity is given by the complexity of the program plus the complexity of 
the output given the program. This is the starting point for several complexity models. 
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like Andrcy Kolmogorov, Ray Solomonoff and Gregory Chaitin's computational comlex- 
ity models, Jorma Rissanen's Minimum Description Length (MDL), and Chris Wallace 
and David Boulton's Minimum Message Length (MML). All these alternative complexity 
models can also be used to successfully reconcile the notions of primary and secondary 
randomness, showing that they are asymptotically equivalent, see Chaitin (1975, 1988), 
Kac (1983), Kolmogorov (1965), Martin-L6f(1966, 1969). 



G.2 Integration and Variance Reduction 

This section presents the derivation of generic Monte Carlo procedures for numerical inte- 
gration. We follow the presentation of Hammersley (1964). Let us consider the integration 
of a bounded function, < f{x) < 1 in the unit interval, x e [0, 1]. The crud Monte 
Carlo unbiased estimate of this integral is the mean value of the function evaluated at 
uniformly distributed iid random points, Xi & [0,1], i — 1 : n, with variance 

7 = / f{x)dx , 7^ = - V f{xi) , crl^ - {f{x) - jf dx . 
Jo ''^ 1 ''^ Jo 

An alternative unbiased estimator is the hit-or-miss Monte Carlo, defined by the 
auxiliary hit indicator function, h. 



h{x,y) = I{f{x) > 2/) , 7 



/ / h{x,y)dxdy , % = - y^h{xi,yi) = 
Jo Jo ^1 



n 



The variance of this method is that of a Bernoulli variate. Simple manipulation shows 
that 

al = ^^^^ , al-a!^- f f{x){l - f{x))dx > . 

Hence, hit-or-miss MC is worst than crude MC, as one could guess from the fact that it 
is using far less information about / at any given point, Xi. 

An other alternative is importance sampling MC, defined by an auxiliary sampling 
distribution, g, in the integration interval, 

^^/^^^^'^^/f)^^^^'^^/f)'^^^^' 

7s = - > , Xi-^g,i^l:n; cr^ ^ - —— - 7 dG{x) . 

n^g{xi) nj \g{x) J 

The importance sampling method can be used on an arbitrary integration interval, as 
long as we know how to draw the points Xi according to the samphng distribution. The 
variance of this method is minimized \i g ca f , that is if the sampling distribution is 
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(approximately) proportional to the integrand. In order to achieve a small variance and 
numerical stability, it is important to keep the sampling ratio bounded, f/g<c. In 
particular, if the integration interval is unbounded, the tails of the sampling distribution 
should "cover" the tails of the integrand. 

The formula for o"^ suggests yet another strategy of variance reduction. Let ^p{x) be 
a function that closely emulates or mimics f{x), but is easy to integrate analytically (or 
even numerically). Such a (p{x) is known as a control variate for f{x). The desired integral 
can be computed as 

7 = y" cp{x)dx + J {f{x) - cp{x))dx ^ -f' + J {f{x) - cp{x))dx . 
Consider the following estimators and variances, 

^ n 1 " 

Var(7-7') = Var(7) + Var(7') - 2Cov (7, 7') . 

That is, this the method is useful if the integration and the control variates are strongly 
(positively) correlated. 



Non-Uniform Random Generators 

This section considers some elementary methods for producing i.i.d. non-uniform variates, 
Xi, from a source of uniform variates in the unit interval, Ui ~ C^]o,i[- Perhaps the simplest 
example is to produce a Bernoulli variate: 

(a) li < Ui < p, then Xi — 1, else (p < Ui < 1), Xi — 0. 

If F{x) is the cumulative distribution of f{x), and Xi ~ /, then u — F{xi) ~ ^[o,i]- 
Hence, if F{x) is invertible, we can just take Xi — F~^{ui) as a mechanism for generating 
/ distributed variates. For example: 

(b) The exponential distribution with mean 1/A is given by f{t) = Aexp(— At), and 
F{t) — 1 — exp(— At). Hence, t — (—1/A) In(it) produces an exponential variate. 

(c) The Cauchy distribution with location and scale parameters, a, b, is given by 
l/f{x) = 7r6(l + {{x - a)/hf), F{x) = (1/2) + (I/tt) arctan((a; - a)/b). Hence, x = 
a + 6tan(7r(w — (1/2))) produces the corresponding Cauchy variate. 

The characterizations of a distribution in terms of a second distribution may offer an 
implicit generation mechanism. For example: 

(d) The Chi-squared distribution with 2 degrees of freedom, xh ^ particular case of 
the exponential with mean (1/A) = 2. Hence, x — —2\r{u) generates a x| variate. 
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(e) A Xd variate is characterized as a sum of squares of d normal variates. Hence, if d 
is even, we can generate a Xd variate as x — 2 ln(tiiti2 . . . Ud/2)- 

(f) Counting consecutive A-cxponential arrivals until the threshold ti + t2 ■ ■ ■ + tk > 1 
produces a rate-A Poisson variate, f{k) — ex.p{—X)X'^/k\. 

(g) Appendix B presents characterizations of many discrete distributions by the Pois- 
son, hence providing implicit generation mechanisms for those distributions. 

(h) The following two dimensional transformation method generates two i.i.d. stan- 
dard Normal variates, see Ripley (1987). 



It, V ~ f/[o,i] , 9 — It^u , r = ■\/—2\n[v) , x — rcos{9) , y — rsin(^) . 

To check the method consider the transformation to polar coordinates, [r, 6], of a standard 
bivariate normal [x,y] ~ (l/27r) cxp((— l/2)(x^ + 1/^))- 

, ^, 1 f-r^\ cos(e) sm(e) 1 f -r^^ 

Hence, r and 9 are independent, 9 is uniformly distributed in [0,27r], and is a X2 
variate. Finally, we see that r is produced by the transformation defined in item (d) 
above to generate a X2 variate. 

If the scaled density, ng, can be used as an envelope dominating the density /, that 
is, / < K,g, the following acceptance-rejection method due to von Neumann can be used: 
(1) Generate [i/i, Mi] ~ 5( X [/[o,i] until nui < f{y.i)/ g{yi). (2) Take = y^. 

The Gamma distribution with parameter c is f{x) = x'^"^ exp(— a;)/r(c). For c = 1 
this is the exponential distribution, also, the sum of two gamma variates with parameters 
ci,C2 is a gamma variate with parameter ci -|- 02- The following results given in Deak 
(1990, sec. 4. 5) provide implicit acceptance rejection generation methods: 

(i) For c < 1, f{x) is dominated by the following density g{x) scaled by the factor 
K = {cT{c))~^ + (er(c))~^. Moreover, has an easy analytic form. 



(^x-i, if X e [0, 1] ^ { ^^x% if X e [0, 1] 

ifxe [l,oo[ ' ^ ^ \ + j^^il - e'-% iixe [l,oo[ 



(ii) For c > 1, f{x) is dominated by a Cauchy with parameters a — l/\/2c — 1 and 
b — c—1, scaled by the factor k — 7r-\/2c^n^exp(— c +l){c— l)'^~^/r(c). 

(iii) For c > 1, f{x) is dominated by the envelope density, gdx), and scale factor, Kc, 
described as follows. First, let us consider an auxiliar variate distributed as the t-density 
with 2 degrees of freedom. The auxiliar density, g{y), cumulative distribution, G{y), and 
generation method by direct inversion are as follows: 



9iy) 
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Next, let us consider the envelope variate with density and scale factor defined as 

1 -r( ) ( 1 ^ ~ ^) V ^ ^ 

f^cgdx) ~ ^""^ r ^ 2 1^^3c/2 -3/8 J j ' 

The envelope variate can be generated from the auxiliar variate as 

X ~ (c - 1) + y^/3c^^3/4: . 

(iv) It is easy to check that if |/ is a gamma variate with parameter c + 1 and u is 
uniform in [0, 1[, than x — yv}^'^ is a gamma variate with parameter c. This property can 
be used to use a gamma generator in the domain c < 1 to generate a gamma variate with 
parameter c > 1, and vice- versa. 

Appendix B presents characterizations of the Beta an Dirichlet distributions by the 
Gamma, hence providing implicit generation mechanisms for those distributions. For 
more non-uniform random generation methods sec Dcak (1990), Gentle (1998), Lange 
(2000), Ripley (1987), and the encyclopedic work of Fishman (1996). 



G.3 MCMC - Markov Chain Monte Carlo 

This section uses the matrix notation and the basic facts about homogeneous Markov 
chains reviewd in section H.l. 

Markov Chain Monte Carlo, Conditional Monte Carlo, etc. are common names for 

methods that generate indirect random sampling for a discrete target density g. MCMC 
sampling is based on a Markov chain that has the target density as its limit distribution. 
Our presentation follows ch.l of Gilks ct al. (1996). For the original papers, see Geman 
and Geman (1984), Hastings (1970), Metropolis and Ulam (1949), and Metropolis et al. 
(1953). 

The basic idea of the MCMC algorithms is to adapt a general (irreducible and aperi- 
odic) sampling kernel, Q, to the desired target distribution, g > Starting form state i, 
the MCMC algorithm proceeds as follows: 

(1) A candidate for the next state, j, is proposed with probability Ql- 
(2a) The chain moves to the candidate j with acceptance probability a{i,j). 
(2b) Otherwise, candidate j is rejected, and the chain remains at state i. 
(3) Go to step 1. 

Formally, the MCMC transition kernel, P, has the form 

= Qia{i, j) + I{j = i) (l - 3)) ' 
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where the first term corresponds to the acceptance of new state j, while the second term 
corresponds to the rejection of the proposed candidate, indicating that the chain remains 

at state i. 

In order to obtain the target distribution, g, as the hmit distribution of the MCMC, 
we want to choose an acceptance probabihty, a{i,j), that enforces the detailed balance 
equation. 



It is easy to check that the acceptance probabilities suggested by Metropolis-Hastings and 
Barker accomplish the goal. They are, respectively. 



In Bayesian statistics, MCMC methods are typically used to compute /, the expected 
value of a function, f{6), on a specific region of the parameter space, T C 0, with respect 
to the posterior density, Pn{d)- In standard Bayesian models, = c{y)~^L{9 \ y)p(){9), 

where po{0) is the prior distribution of the parameter 9, L{9 \ y) is the likelihood of 9 given 
the observed data y, and c(|/) is the posterior normalization constant. Hence, 



Notice that Q;(i,j), the acceptance probabilities defined above, can be computed from 
posterior ratios Pn{d'^) / Pn{d^) = 9'' Iq^- Hence, actual implementations of these MCMC 
algorithms do not require the explicit knowledge of the target distribution normalization 
constant, ciy). It suffices to have an un-normalized function that is proportional to the 
target distribution, g{9) oc Pn{9), as it is the case for the likelihood-prior product. 

The original Metropolis algorithm uses a symmetric sampling kernel, Ql = see 
Metropolis et al. (1954). In this cacsc, Metropolis-Hastings acceptance probability can 
be simplified to the form a{i,j) = mm{l, /g'-). In statistical physics, the density of 
interest, g^, often takes the form of the Boltzmann distribution, g^ — ex.p{—/3H{i)), where 
the Hamiltonian function, H{i), gives the energy of the corresponding state. In this case, 
a new state of lower energy, j \ AH = H{j) — H{i) < 0, is accepted for sure, while a 
state of higher energy is accepted with probability exp(— /3Ai7). In section H.l, the same 
acceptance rejection mechanism reappears in Metropolis version of Simulated Annealing. 

Random Walk Metropolis algorithms use a symmetric kernel that is a function only of 
the random walk step, z = y — x, that is, Q{x, y) = Q{z) = Q{—z). A common option in 
practical implementations is to chose the random walk step from a multivariate Normal 
distribution, z ~ A^(0,S). The covariance matrix, S, scales the random walk steps. If 
the steps are too large, the proposed steps would often result in sharp decrease of the 
traget density, so the acceptance rate is low, making the MCMC inefficient. If the steps 



9 



ff = g'Qlait^j) = g'Q)aij,i) = g^P] . 





fi9)g{0\y)d0 , g{e)^L{0\y)po{0) , c{y) 
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are too small, the acceptance rate may be high, but too many steps are required to move 
effectively across the integration region and, again, the MCMC is inefficient. 

A practical solution is to take the covariancc matrix, S, proportional to the inverse Hes- 
sian matrix, {—d^ \ogg{x)/dx'dx)~^, computed at the estimated mode, x = axgmaxg(x). 
Alternatively, one can take S proportional to a convex combination of the diagonal ma- 
trix D, a prior estimate of marginal variances, and the current estimate of the sampled 
covariance matrix. 



In both cases, the proportionality constant is interactively adapted in order to obtain 
an acceptance rate in a specified range. If the target distribution has heavy tails, this 
sampling kernel may be modified to a multivariate student's t-distribution. For furthe 
details, see Gilks et al. (1996). 

Cyclic MCMC schemes use a "composit kernel" that updates, one by one, the indi- 
vidual components of a /c-dimensional vector state, x. That is, a cyclic MCMC goes from 
the current state, x to the next state, y, by k intermediate steps, x = [xi,X2, ■ ■ ■ Xk], 
[yi,X2, . . .Xk], [yi,y2, ■ ■ -Xk], ■ ■ ■[yi,y2, ■ ■ ■ Uk] = y- Cychc schemes include the Gibbs sam- 
pler, popularized by Geman and Geman (1984), and many useful variations. 

G.4 Estimation of Ratios 

This section presents the derivation of the Monte Carlo procedure for the numerical 
integrations required to implement the FBST. The symbol X represents the observed 
data or some sufficient statistics. The best approach to the numerical integration step 

required by the FBST is approximation by Monte Carlo (MC) simulation, sec Appendix 
A for the FBST definition, and Evans and Swartz (2000) and Zacks and Stern (2003) for 
the Monte Carlo approach to this integration problem. We want an estimate of the ratio 



Since the space © is unbounded, we randomly chose the values of 9 according to an 
"importance sampling" density g'(^), which is positive on 0. The evidence function is 
equivalent to 




ev [H] X) 



kf{e-x)de 



T = T{s*) , T{v) 



{eee\ s{e) > v} . 



ey{H;X) 



j^z;{e;X)g{e)de 
j^z,{e;X)9{e)de 
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where 



Z,{e;X) = and 

z;{e;X) = r{e-x)z,{e-x) 
r{e;X) = i(^er) 

Thus, a Monte Carlo estimate of the evidence is 



EVg,m{X) 



ZT=iZ,{e^;X) 



where 9^ ,j — 1 . . .m are iid and independently chosen in © according to the importance 
samphng density g{9). Thus, 



Evg^miX)""-^ ey{H;X) a.s.^] 

The goodness of the MC estimation depends on the choice of g and m. Standard statis- 
tical software libraries have univariate random generators for most common distributions. 
These univariate generators can also be used to build vector variates from multivariate 
distributions. Appendix D describes how to generate a Dirichlet vector variate from 
univariate Gammas. 

Johnson (1980) describes a simple procedure to generate the Cholesky factor of a 
Wishart variate W — U'U with n degrees of freedom, from the Cholesky factorization of 
the covariance parameter V — — CO: 

L{ = Ar(o,l) , i>j 

Li = + ; U^L'C 

At the integration step it is important to perform all matrix computations directly from 
Cholesky factors, Golub and van Loan (1989), Jones (1985). In this problem we can 
therefore use "exact sampling", what simplifies substantially the integration step, i.e., 
Z,{e;X)^l. 



Precision of the MC Simulation 

In order to control the number of points, m, used at each MC simulation, we need an es- 
timate of MC precision for evidence estimation. For a fixed large value m, the asymptotic 
distribution of Evg^m{X) is normal with mean ev {H; X) and asymptotic variance Vg{X). 
According to the delta method, Bickel and Doksum (2001), we obtain that 
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where 



l^g = [ Z,i9;X)g{9)d{9) 
Je 

fil ^ [ zi{e-x)g{e)d{e) 

Je 

a/ = / {z,{e■,x)-^,,fg{e)d{e) 

< = / {z;{9;X)-i,iyg{e)d{9) 
Je 

7/ - [ {z,{e;X)-^^,){z;{e;X)-^^i)g{e)d{e) 

Je 



are the expected values, variances and covariance of Z{9;X) and Z*{9;X) with respect 
to^(^). 

Define the coefficients 

c - ^ - i 

Sfl ) So ~ 

For abbreviation, let 77 = ev(i7;X). Also note that r] — /ig/iig- Then the asymptotic 
variance is 



Let us define the complementary variables 



Z^g{0;X) = r{e;X)Zg{0;X) 

r(e;X) = i-i*(e;X) 



CC _ 9 
'So 



Some algebraic manipulation give us Vg{X) in terms of ^* and namely 

v^.w = ^{Cg\i-vr+egV+2v'{i-vr) 

lib 



For large values of m, the asymptotic (1 — /3) level confidence level confidence interval 
for -q is ± l^g,m,p, where 
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where Fi_^(l,m) is the 1 — (3 quantile of the F(l,m) distribution, and f], ^* and are 
consistent estimators of the respective quantities. 

For large m, we can also use the approximation, 

since F(l,m) converges in distribution to the chi-square distribution with 1 degree of 
freedom, as m ^ oo. 

If we wish to have A^^^.^s < for a prescribed value of 6, then m should be such that 
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G.5 Monte Carlo for Linear Systems 

Want to solve the simultaneous matrix equation, 

X — Hx + 6 , H n X n 

The (Direct) Monte Carlo methods of von Neumann and Ulam (NU) and of Wasow 
(WS) are based on probability transitions, P- , and multipliers or weights, satisfying 
the following conditions: 

If = )7(jf > 0) I i^l^O^if >0 A /^-Kl 

We also define the extended Stochastic matrix, 

P defines a Markov chain in a space of n + 1 states, {l,2,...n, n+1}, where the last 
state, n + 1, is an absorbing state. 

We want to consider a random path or trajectory, T, of a "particle" starting at state 
i, until the particle is absorbed at step m + 1, that is, 

T = [T(l) = i, T(2), . . . , T{m),T{m + 1) = n + 1] 

We define a random variable, X{T), associated to each trajectory. 
First we define the multipliers products 

vi = l and Vk = Vk-iV^^^l-^^ , 2 < k <m . 

Von Neumann - Ulam's and Wasow's versions of the Monte Carlo Algorithm, use X{T) 
equal to, respectively, 

NU{T) = v^bTim)/PTi^) and WS{T) = Y.^^.'^Mk) ■ 

The key to these Monte Carlo algorithm is that the expected value of the variable 
X(T), over all trajectories starting at state i, is the solution of the simultaneous equation, 
provided these expected values are well defined, that is, if 

if a = E{X{T) I r(l) = i) then e = He + b 
Let us prove the statement above for Wasow's version. By definition. 



PI 



pn pr 
-^1 -^1 



n+i 





pn 


pn+1 


n 


n 


n 


•• 


■ 


1 
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k=l 

E X{T)Pr(T) . 

T = [T(l) = i, T{2) = i, . . . T(m + 1) = n + 1] , 
m = 1, 2, . . . oo . 

Given a trajectory T, we can separate the terms in X{T) with index 1, that is, 

X{T) = + J^(^(2 : ^+1)) , hence, 

n+l 

j=l S=\j,...n+1] 

j=l S=\j,...n+1] 

n / n 

= E^'^/ E x{s)Pr{s)+bApi^+'+j2p^T.^<^^ 



i=l S=b-,...n+l] 



= J2Hiej + bi, Q.E.D. 

The Reverse or Adjoint Monte Carlo methods of von Neumann and Ulam (NU) and 
of Wasow (WS) are based on probabihty transitions, Ql, and multiphers or weights, W- , 
satisfying the following conditions: 

w/ = m/Qi)iiQi > 0) I m^o^Qi>o A gi-i < 1 



We also define the extended Stochastic matrix, 

Q\ ■■■ Qr' 



Q = 



Ql ■■■ Ql QT' 
••• 1 



We want to consider a random path or trajectory, T, of a "particle" starting at state 
i, chosen at random with probability rj, until the particle is absorbed at step m + 1, just 
after visiting state T{m) at step m , that is. 



T = [T(l) = T(2), . . . , T(m), T(m + 1) = n + 1] 
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Wc define random variables associated to each trajectory. First we define the multiplier 
products 

wi = bi/n and Wk = Wk-iW^j^^l^^ , 2 < k <m . 

Von Neumann - Ulam's and Wasow's versions of the reverse or adjoint Monte Carlo 
Algorithm, use Xj{T) equal to, respectively, 

NU,{T) = ^„.5^(^)/g^ti and WS,{T) = Y.,=^^^^Tik) " 

Again, the key to these Monte Carlo algorithm is that the expected value of the variable 
Xj{T), over all trajectories ending at state j, is the solution of the simultaneous equation, 
provided these expected values are well defined. The proof for the reverse method is 
similar to the direct case. 
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Appendix H 



Stochastic Evolution and 
Optimization 

"God does not play dice (with the universe)." 

Albert Einstein (1879 - 1955). 

"Einstein, stop telling God what to do (with his dice)." 

Niels Bohr (1885 - 1962). 

"God not only plays dice, He also sometimes 
throws the dice where they are not seen. " 

Stephen Hawking (1942 - ). 

This section gives a condensed introduction to inhomogeneous Markov chains, the 
theory that is needed to formaUze Simulated Annealing (SA) and related algorithms 
presented in chapter 5. We follow the presentations in Jetschke (1989) and Pflug (1996, 
eh. 2), and assume some familiarity with homogeneous Markov chains, as presented in 
Feller (1957, ch.l5) or Haggstrom (2002). 

H.l Inhomogeneous Markov Chains 

We begin by introducing some notation for this chapter. First, a notational idiosyncrasy: 
In almost all areas of mathematics it is usual to write a d-dimensional vector as a d x 1 
column matrix, x, and a linear transformation as the left multiplication oi x by a. d x d 
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square matrix A, that is, Ax. However, in the hterature of Markov chains, it is usual to 
write a a rf-dimensional vector as a 1 x or row matrix, v, and a hnear transformation 
as the right muhiphcation of v hj a d x d square matrix P, that is, vP. Herein, we make 
use of the two forms, according to the context. 

rf-Dimensional vectors are written in lower case format, v. A density or probability 
vector is a vector in the simplex support, f > and \\v\\ = 1. (i-Dimensional (square) 
matrices, on the other hand, are written in upper case format, P. In particular, / is 
reserved to denote the d-dimensional identity matrix. A d-dimensional kernel or transition 
probability matrix has its rows in the simplex support. Right subscripts and superscripts 
will index matrices rows and columns. For instance. Pi, P^ and P/ will indicate the i-th 
row, the j-th column, and the element or entry i,j of matrix P, respectively. In the same 
way, Xi and denote, respectively, the i-th element of the column vector x, and the j-th 
element of the row vector v. 

Braces are used to index a sequence of objects, such as P{1}, P{2}, . . . P{t}. The 
symbol P{s :: t} will denote the product of the objects indexed from s to t, that is, 

P{s::t}^l[[^^P{k} . 

Finaly, given scalars, a and /3, we have, as usual, a A/3 = min(Q;, q;V/3 = max(Q;, 
a+ = V a, = V —a. 



Homogeneous Markov Chains 

In a Markov chain with kernel or transition matrix P, P- > such that Pjl = 1, P- 
represents transition probability from state x{i} to state x{j} in a finite state space, 
S = {x{l}, x{2}, . . . x{d}}. For the sake of simplicity, we often write the index, i, instead 
of the indexed state, x{i}, that is, we identify the state space with its index set, S — 
{1,2,... d}. 

A trajectory or path of length t from an initial state i to a final state j is given by 
T = [t(1) = i,k{2), . . .T{t),T{t + 1) = j]. If a Markov chain is initially at state i, the 
probability that it will follow the trajectory r is 

If we select the initial state state, i, from distribution v, v > 0, vl = 1, the probability 
that the chain is at state j after t transitions, following any possible trajectories trough 
intermediate states, is given by , where 



w — V TT P 

-L-LA;=1 
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A trajectory r is possible if it has non-zero probability. A Markov chain is irreducible 
if there is a possible trajectory connecting any initial state, i, to any final state, j. A 
cycle is a trajectory with the same initial and final states. State i has period A; > 1 if any 
possible cycle starting at i has length multiple of k. Otherwise, state i is aperiodic. A 
Markov chain is aperiodic if has no periodic states. 

The probability distribution g is invariant by kernel P if = gP. An invariant 
distribution is also known as eigen-solution, equilibrium or stable distribution for P. It 
can be shown that an irreducible and aperiodic Markov chain has a unique invariant 
distribution, see Feller (1957). Under the same regularity conditions, it can also be shown 
that the invariant distribution is the chain's limiting distribution, that is. 



limTT* P = 



Hence, for any initial distribution, v, 

V ( \im TT 

\t->oo-'--Lfe=i 



" 9 ' 






9' ■ 


■ 9" 


9 




9' 


9' ■ 


• 9'' 


. 9 . 




.9' 


9' . 


• 5" 
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Given the irreducible and aperiodic kernel P, having the stable distribution g, the 
reverse kernel, P, is defined as P^ = g^Pj/g"^. The reverse kernel can be interpreted, using 
Bayes theorem, as the kernel of the Markov chain P going backwards in time, that is. 



Rj ^Pr{x{t} =j\x{t + l} 



Fr{x{t + l} = i\ x{t} = j)Pr{x{t} = j) _ Pjg^ 



VY{x{t + 1} 



9' 



Kernel P is reversible if there is a distribution g statisfying the detailed balance equation, 
g^P- — g^Pj- Summing both sides of the detailed balance equation over index we 
obtain g^ = "^^g^Pi, showing that this is a sufficient condition for g to be an invariant 
distribution. Hence, for a reversible chain, the forward and backward kernels are identical. 



P^ 



P^- 



Vector and Matrix Norms 



A norm, in a vector space E, is a function 

II . II : E ^R\yx,y e E and a eR , 

1. |b|| > 0, and ||a;|| = <^ x = 0. 
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3. + y\\ < \\x\\ + \\y\\, the triangular inequality. 

In particular, for x G R" and p > 0, 

II II — ('\^" I jP'ii/p II II _ " I I 

■'—'1 i=l 

defines the standard Lp norms in R". 

Given a normed vector space, {E, \\ ||), 

||r||=ma^( \\T{x)\\ / \\x\\ ) . 

defines the induced norm on the vector space of linear transformations, T : E E, for 
which 3a G R | ||T(a;)|| < Q;||a;||, Va; G E, that is, the vector space of bounded linear 
transformations on E. By linearity, 

||r|| = max \\T(x)\\ . 

x\ \\x\\=l" 

In (R", II II) the induced norm on set of bounded linear transformations, T : R" — )> R", 
defines the matrix norm in (R", || ||). Speciffically, for an n x n matrix A, \\A\\ = \\T\\, 
where T{x) = Ax. 

Lemma 1: The matrix norm in (R", || ||), has the following properties: If A and B 
are n x n matrices, 

1. \\A\\ > and p|| = <^ A = 

2. \\A + B\\ < \\A\\ + \\B\\ 

3. \\AB\\ < \\A\\ \\B\\ 

Lemma 2: (Li and explicit expressions). 



1 = max > 

1=1 ^i=i 



\\A\U = m^axj^" Kl 
Proof: To check the expression for Li and L^o observe that 

.^Ja;,-|max^.^JA^| = ||A||i ||x||i 
ll^xlloo = m&x\J2. < max^ |A]| |xj 

1=1 ' 'J=i 1=1 ' '3=1 

< max|a;j|max> \Al\ — ||x||oo ||^||oo 

j=l i=l ^—^i=l 



l-^J I 
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and that, if k is the index that reahzes the maximum in the norm definition, then the 
equahty is reahzed by the vector x = I'' for Li, and by the vector x \ Xj = sig{A-l) for 

Loo- 

One can check that ||a;||oo < < n||a;||oo and ||a;||oo < ll^lh < "^"""^^lla^Hoo- In fact, 

any given p norn can provide a bound to another q norm and, in this sense, they are 
equivalent. In the remaining of this section the Li norm will be used throughout, so we 
will write ||a;|| for 

Dobroushin's Contraction Coefficient 

Lemma 3 (Total Variation). Given two probabihty (non-negative, unitary, row) vectors, 
V and w, their Total Variation or Li difference has the alternative expressions: 

\\v-w\\ ^2(l-^^v'' Aw''^ K'-'f^'')^ 

Proof: Just notice that 
2 - 2 V t;^ A = y v'^ + y w'^ - 2 V v'' Aw'' Iv'' - w''\ , and 

^-^k ^k ^k ^k -^fc' ' 

{y'' - w''Y = -v'' A w'') hence {y'' - w'')'^ = 1 - v'' Aw'' . 

The Dobroushin Contraction Coefficient or Ergodicity Coefficient of a transition prob- 
ability matrix, P, is defined as 

p{P) = I max T\P' -Pj\ = l max||7,P - 7,P|| . 

Z 1,3 ^-^k 2 1,3 

It is clear from the definition that p{P) measures the maximum Li distance between the 
rows of P. If a sequence of kernels, P{k}, is clear from the context, we shall also write 



TT* 

p{k} = p{P{k}) , and p{m::n} = J^J^^ p{k} ■ 



Lemma 4 (Vector Contraction). Two probability vectors, v and w, are contracted by 
the transition matrix P in the sense that: 

\\vP-wP\\ <p(P)||^;-'u;|| . 

Proof: If V = w or if V — li and w — Ij, the result is trivial. Otherwise, let v ^ w 
and m — V Aw. Defining 

\\v — w\ 

it is easy to check that: 



.3 _ ^{Vj - mi){Wj - TTlj) 
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(a) Gi>0, 

{h)v-w^Y,ijGi{I,-I,),and 
Hence, 

\\vP - wP\\ = iiy Gi{ii - L)p\\ < 

(Y,. . Gl) max||(7, - 7,)ll = - ^l|2p(P) = - ^'ll ■ 

Lemma 5 (Matrix Contraction). Two transition matrices, P and Q, are contracted 
in the sense that: 

p{PQ) < p{P) p{Q) . 

Proof: 

p{PQ) = Jmax||(7, - Ij)PQ\\ < p(g) J max||(7, - Ij)P\\ = p{P) p{Q) . 

Theorem 6 (Weak Ergodicity (loss of memory)). 

hm p{l::t} = 0^ \im\\{v - w) P{l::t}\\ =0 . 

t->oo t—^oo 

Proof: Immediate, from Lemma 2. 

Lemma 7 (Strong Ergodicity). Assume that the following conditions hold: 

(a) Each P{k} has a unique invariant distribution, v{k} — v{k}P{k} , such that 
Er=ill^{^ + l}-^WII<oo ; 

(b) p{A;}>0 ; 

(c) p{l::oo} = . 

Then, there is a limiting distribution, v{oo}, such that, for any distribution w, 

lim||wP{l::0 -'y{oo}|| = 

t— >-oo 

Proof. Condition 7a ensures that, with respect to the Li norm, v{k} is a Cauchy 
sequence in the compact simplex support. Hence, the sequence has a unique accumulation 
point, ^{oo} = limfe^oo^'{^}- 

Since for 1 < s < i < oo 

v{oo} P{s wt} - v{oo} = {v{oo} - v{s}) P{s ::t} + v{s} P{s :: t} - v{oo} = 

(v{oo} - v{s}) P{s :: t} + V*^^ {v{k} - v{k + 1}) P{k + l::t} + v{t} - v{oo} , 
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we have the inequahty 

\\wP{l::t}-v{oo}\\ < \\{wP{l -.-.t - 1} -v{oo}) P{s ■.■.t}\\ + \\v{oo} P{s -.-.t} - v{oo}\\ < 

En— 1 
\\v{k} - v{k + + \\v{oo} - v{t}\\ < 
k=s 

2 p{s::t} + 2 sup\\ V {oo} -v{k}\\ + y"^ \\v{k} - v{k + 1}\\ . 

k>s ^k=s 

Letting t — > oo, all terms in the right hand side can be made arbitrarily small for an 
appropriate large value of s. Consequently, the left hand side converges to zero, Q.E.D. 

Theorem 8 (Small Perturbations). It is possible to use a perturbed sequence of 
kernels, Q{k}, instead of P{k}, and still obtain convergence to the same invariant distri- 
bution provided that 

Yr^JP{k}-Q{k}\\<^. 
Proof. The result follows from the inequahty 

II P{s :: t] -Q{s:: t]\\ < J^l II P{k} ' Qm 

The Small Perturbations theorem plays an important role in the design of efficient 
algorithms based on heuristic perturbations, a technique that can greatly expedite the 
annealing process, see Stern (1991) and Pflug (1996, ch.2). 

H.2 Simulated Annealing 

The Metropohs Algorithm 

Consider a system, X, where the system state is parameterized by a d-dimentional 
coordinate vector x — [xi,...Xd] £ X. The neighborhood N{x) is defined as the 
set of states y that are adjacent to x, that is, the set of states that can be reached 
dirrcctly from x, taht is, with one move, or in a single step. The neighborhood size is 
n{x) = \N{x)\ <n = maXj.n(x). We assume that the neighborhood structure is symmet- 
ric, that is, y G N{x) =^ x & N{y), and that any two states, x and y, are linked by a path 
with at most m steps. Our aim is to minimize a finite and positive objective function, 
H{x), with an unique global minimum attained at x*. The system's Lipschitz constant, 
A, is the maximum difference in the value of H, for adjacent states, that is, 

A = max max \H{y) — H[x)\ . 

X yeN{x) 

The Gibbs distribution is defined as 

9(0), - 1^ exp(-eH{x)) , with Z{e) = J2,n{x) exp(-e//(x)) . 
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The Gibbs distribution specifies state probabihties in many systems of Statistical Physics, 
where the Hamiltonian function, H, represents state energies, and the parameter 6 is the 
system's inverse temperature. The normahzation constant, Z{6), is called the partition 
function. 

The Metropolis kernel is defined by 

exp {{H{x) - H{y))^) , if y e N{x) 

otherwise 

Theorem 9 (Metropolis sampling). The Gibbs distribution g{6) is invariant for the 
metropolis kernel P{9). 

Proof. It suffices to prove the detailed balance equation 

g{e)^P{9)l = g{e)yP{9r^ 

li y ^ N{x), balance is trivial. Otherwise, we use 

^ exp {{Hix) - Hiy)r) = ^ ( A l] . 

Assuming that {g{9)yn{x))/{g{9)a:n{y)) > 1, 

n{x) y n{y) g{9)yn{x) n{x) 

The case {g{9)yn{x)) / {g{9)xn{y)) < 1 follows similarly. 

We will now study an appropriate cooling schedule 6'i, 6'2, . . ., for the Simplified Metropo- 
lis Algorithm where, at each temperature l/9t, we take m steps using the kernel P{t} = 
P{9t), or a single step using the kernel Q{i} — P{t}^ 

Theorem 10 (Logarithmic Cooling). In the simplified Metropolis algorithm, for any 
monotone decreasing cooling schedule 

1 Am^ ln(n) 
9't - \n{t) 

and any initial distribution w, 

lim ||wg{l}g{2} . . . Q{t} - v{oo}\\ = . 

t— >-oo 

Proof. From the definition of the system's Lipschitz constant, and from the fact that 
any two states of the system are conected by a path of lenght at most m, it follows that, 
for any two states, x and y, 

Q{t}l>(^exp(-A9t)) = ^exp(-mA^,) . 
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Hence, 



max 



p{t} 



(1 



piQ{t}) = max 1 - J] ( Q{t}l A Q{t}l ) < 

\ z / 

- ( Q{tr: A Q{tr; ) ) < l - i- exp (-mA^,) = 



1 - 



n- 



1 



m 




Condition (c) of the strong ergodicity lemma follows from 

OO OO . X 

J]- = oo ^ p{l::oo} = n(l-^j=0. 

Finally, in orther to check condition (a) the strong ergodicity lemma, we must show 
that the invariant measures v{t} = v{9t) constitute a Cauchy sequence. However, as 
9 — > OO, the elements of v{t} are either increasing for x = x* ox decreasing for x ^ x* and 
sufficiently large 9. Hence, for t>l, 



There is an implicit choice of scale in the unit taken to measure the Hamiltonian or 
objective function, H{x). An adequate scale should start the annealing process with a 
good acceptance rate for hill climbing moves. The step size of the logarithmic coohng 
schedule is inversely proportional to the cooling constant, Am^ln(n). An alternative to 
the simplified Metroplis algorithm, taking m steps at each temperature 9t-, is to implement 
the standard Metropolis algorithm using the cooling constant Am^ln(n). 

H.3 Genetic Programming 

The Intrinsic Pcirallelism Argument 

Consider programs coded as binary (0-1) arrays of length n. A pattern or schema of 
length /, is a partial specification of a binary array of length Z, 



The number of specified positions or loci, that is, / minus the number of don't-cares, 
defines the schema's order. The program's sub-array p[j] in the window k < j < k + I, is 
an instance of schema s iff they coincide, in the specified loci, that is, iff p[k + i] = s[i], 
for all s[i] ^ *. 



Y7jv{t + 1} - .{Oil = EZ. E.,x (-0 + 1} - mt < 




s[i] = 0, 1 or * (don't-care) , 1 <i <l . 
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The intrinsic parallehsm argument, presented in chapter 6, requires an estimate of 
how many schemata of order / and length 21 can be represented in a program of length n. 

Following Reeves (1991), consider the window of lenght 21 at the beginning or leftmost 
locus, 1 < J < 21, and let B(2l, I) be the number of choices for the specified loci, /, among 
the 21 available positions. This first window can obvously represent B{2l,l) 2} distinct 
schemata, for once the / loci have been chosen, there are 2} possible 0-1 attributions to 
their values. 

Now slide the window 21 position to the right, so as to span positions 2/ + 1 < j < 4/. 
This new window has no positions in common with the previous one and can, therefore, 
represent the same number of schemata. If we keep sliding the window 21 position to the 
right until positions n — 21 < j < n are spanned, it follows from Stirling's approximation 
that the total count of possible represented schemata, satisfies the relation 



where the population size is taken as m = c 2 . The constant c is interpreted as the 
expected number of instances of any given schema (of order / and length 21) present 
in this population. Hence, under all the conditions above, we can (under) estimate the 
number of schemata present in the population as proportional to m^. For generazations 
of the implicit parallelism theorem, see Bertoni, M.Dorigo (1993). 

Stirling's Approximation 

For large n, 



n 



B{2l,l) 



2' 2^' oc , 



21 



I 

J r- 



En 
In 



]njdj = [jlnj - j] 



n 



1 



nlnn — n + 1 . 



A more detailed analysis of the remainder gives us 



In n! ~ n In n — n + O (In n) . 



From Stirling's approximation, the following Binomial approximations hold: 




where H{p) 



plnp — {1 — p) ln(l — p) . 



and 
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H.4 Ontogenic Development 

Autopoietic and alopoictic systems, living organisms and artificial machines, have both 
to be built up and have their basic components maintained. However, there are profound 
differences in their development processes. In this section we examine the structural 
similarities and differences between such systems, and how such structures can explain 
some properties of systemic development. 

Herein, the adult or after construction systemic feature known as aging, receives special 
attention. Elementary or simple components have no structure, no internal states, and 
hence no memory. They can, therefore, exhibit no aging. Complex systems, however, 
exhibit some form of aging. We will see how the aging process of complex system can 
reflect systemic structure. We will contrast, in particular, bottom-up and top-down system 
construction, and their respective aging processes. Our analysis will follow Gavrilov (1981, 
2001, 2006). 

"The first fundamental feature of biosystems is that, in contrast to technical (artificial) 
devices which are constructed out of previously manufactured and tested components, or- 
ganisms form themselves in ontogenesis through a process of self-assembly out of de novo 
forming and externally untested elements (cells). The second property of organisms is the 
extraordinary degree of miniaturization of their components ( the microscopic dimensions 
of cells, as well as the molecular dimensions of information carriers like DNA and RNA ), 
permitting the creation of a huge redundancy in the number of elements. Thus, we expect 
that for living organisms, in distinction to many technical (manufactured) devices, the 
reliability of the system is achieved not by the high initial quality of all the elements but 
by their huge numbers (redundancy)." Gavrilov (2001, p. 531.) 



Aging Processes 

In this section we follow Gavrilov (1981, 2001, 2006) to analyse the aging process of some 
redundant series / parallel reliability systems. 

As usual in reliability theory, t will denote failure time, f{t) and F{t) the density and 
cumulative distribution functions of the failure time, S{t) — 1 — F{t) the survival function 
and 

_ dS{t) _ d \iiS{t) 

~ sWdt ~ It 

the hazard function, failure rate, or mortality force, see Barlow and Prochan (1981). 

Simple, memoryless or non-aging components are characterized by exponentially dis- 
tributed failure time. In this case, the failure time has constant hazard rate, h{t) = k, 
and S{t)) = exp{—K,t), K,,t > 0. Complex systems are characterized by different aging 
regimes which, in turn, reflect their structural characteristics. Two aging regimes are of 
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special interest to us: 

1- The WeibuU or power law regime, with h{t) — nt"', K,a > 0, characteristic of 
complex top-down, external assembly or alopoietic systems, and 

2- The Gompertz-Mekeham regime, with h{t) — A-\-ReyiY){at), A,R,a> 0, character- 
istic of complex bottom-up, self assembly or autopoietc systems. In biological models, the 
Mekeham parameter, A, indicates an external mortality force, whereas the pure Gompertz 
regime, for ^4 = 0, models the internal or systemic hazard function. 

In what follows, we will see some structural models that explain these two regimes 
and test them on some engineering and biological systems. 

The two basic structures in reliability theory are parallel and series compositions. 
Complex systems in general, are recursive compositions of series and parallel blocks. A 
parallel block fails if all its components fail, whereas a series block fails if any one of 
its components fail, alternatively, a parallel block fails with its last failing component, 
whereas a scries block fails with its first failing component. Hence the series-parallel 
reliability compositional rules: 

- The cumulative distribution function of a parallel system with independent compo- 
nents equals the product of its components' cumulative distribution functions. 

- The hazard function of a series system with independent components equals the sum 
of its components' hazard functions. 

Let us now consider the "simplest complex system" modeling an organism or machine 
with multiple, m, functions, where each function is performed by an independent block 
of redundant simple components. That is, a system is assembled as a scries of m blocks, 
hj,j = 1 . . . m, such that block j is assembled as a parallel (sub) system with rij simple 
components. 

Top-down projects typically use a small number of redundant units, in order to op- 
timize production costs as well as to meet other project constraints such as maximum 
space or weight. Hence, components have to comply with strict standards, achieved by 
several forms of quality control tasks in the manufacturing process. In such systems all 
components are initially alive, operational or working, since they would have been oth- 
erwise rejected by quality control. They are typically depicted in block diagram of such 
as shown in Figure lA. In this example each block has the same number, rij — i, of 
redundant components. 

Since each simple component has an exponential failure distribution, the reliability 
compositional rules lead to the following systemic hazard functions for each block. 




y , hj{t) 




H.4. ONTOGENIC DEVELOPMENT 



411 



and to the following systemic hazard function for the whole system, 

miKe~'^* (1 - e~'^*y~^ 



1 - (1 - e-«*)' 



Using the early-life and late-life asymptotic approximations, 1 — exp{—Kt) ~ Kt, for 
t « 1/k, and 1 — exp{—Kt) ~ 1, fort >> 1/k, the i elements parallel block and systemic 
hazard functions can be approximated as 

, , , f iKH'^~^ if t « 1/k and , , , f miKH^'^ if t « 1/k and 
^' \k \lt»l/K; ' ^' \mK lit » 1/k ] 

Let us now consider self-assembled blocks where the number i of initially working 
elements follows a Poisson distribution with parameter A = ng, P{i) = exp{—X)X^ /i\. We 
should also truncate the Poisson distribution, to account for the facts that the organism 
is initially alive, implying the exclusion of the i = case, and that the organism is finite, 
implying a cut-off Pr(i > n) ~ 0. The corrected normalization constant for this truncated 
Poisson is — 1 — exp(— A) — exp(— A) • 

As in the previous model, the systemic hazard function is the sum of those of its 
blocks', where each block begins with i, Poisson distributed, working elements. Hence, 
the expected systemic hazard function can be written as: 

Em ^ — 

Substitution of hi{t) yields the following systemic hazard rate and approximations: 

h{t)^cmKXe-'e-^'y X [1 e ) 

(i-1)! fl-(l-e--*)M 

^^^^^ f cmKXe-'Y:^,i^ = R{e-'-e{t)) if t « l/« and 
1 ruK if t » 1/k ; 

In the last expression, R = ctjikX exp(— A), a = kX, and e{t) = Yl'iLn+i^'^^^Y'^ / 
For for fixed k and A and sufficiently small t, e{t) is close to zero. Hence, in early life, 
h{t) ~ i?exp(Q!t), as in the pure Gompertz regime. 
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Appendix I 
Research Projects 



In the last courses we have had classes of very diverse students. As expected, we had 
students coming from the courses of Applied Mathematics, Physics and, of course. Statis- 
tics, but we also had some students with quite different backgrounds, such as Computer 
Science, Economics, Law, Logic and Philosophy. This appendix proposes some research 
projects that may be specially interesting to some of these students. I do believe, of 
course, that most of them will also be interesting to the student of Statistics. If you are 
interested in one of these projects, send me an e-mail, or stop by at my office, an let us 
talk about how to proceed. 

Bayesian and other Credal Networks 

The sparse factorization techniques described in Appendix F can be transposed to Bayesian 
Networks and other belief propagation networks as well. 

1- Symbolic phase: Implement the algorithms used to find a good elimination order, 
like the Gibbs heuristics, the Bayes-ball algorithm, and the other graph algorithms men- 
tioned in appendix F. A language such as C or C-I--I-, providing good support for dynamic 
data structures, is recommended. 

2- Numeric phase: Once the elimination order, requisite variables, etc. are determined, 
implement the numerical elimination process using static data structures. A language such 
as Fortran, providing good support for automatic parallclization, is recommended. 

3- Investigate the potential for parallclization of the sequential codes implemented in 
steps 1 and 2. Discuss the possibility, difficulties and advantages of developing tailor made 
parallel code versus the use of automatic parallclization tools. 

4- Implement efficient MC or MCMC processes for computing the evidence supporting 
the existence of a given causal link, that is, the existence of a given arrow in (a) a given 
Bayes network topology (b) all or a given subset of topolgies. 
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Mixture of Factor Analyzers 

Extend the theory and methods for Mixtures of Multivariate Gaussians, as described in 
appendix B, to Mixtures of FA's. The geometric interpretation of these models is very 
similar, but whereas all Mixtures of Gaussians lie in the same d-dimensional space, each 
Mixture of FA's lies in a different hyperplane of the full d-dimensional space. In particular: 

a) Test the existence of a given component in the mixture. 

b) Test the existence of the least significative dimension of a given component. 

Polynomial Networks 

1- Discuss the use of edge annotations and heuristic merit functions in the synthesis of 
sub-networks, that is, the use of heuristic "recombinative guidance", in the terminology 
of Nikolaev and Iba (2001, 2006). 

2- Discuss the use of time dependent objective functions, as is section 5.2, to guide 
the synthesis of the entire network. 

3- Discuss how to test (sub) topologies of a given network. 
(De) Coupling, (De) Composition, Complementeirity 

1- Discuss the possibility of using complementary models in contexts other than Quantum 
Mechanics. Give examples of such applications. 

2- Discuss the possibility of extending the results of Borges and Stern (2007) to models 
with limited dependence using, for example, the formalism of Copulas. 

3- Investigate the meaning and interpretation of decoupling or separation schemes 
generated by alternative sparse and/or structured matrix factorizations. 

4- Using wavelet or other self-similar representations, it is possible to overcome the 
strict version of Hciscnbcrg uncertainty relation, see Vidakovic (1999, p.xxx). However, 
these representations may introduce non-local, delayed, integral, long-rage, long-memory 
or other forms of coupling or dependence. Investigate how to obtain generalized Heisen- 
berg type relations for such cases. 

5- Give suitable interpretations and implement statistical models for the "necessary 
or consequential randomness" implied in the following examples: 

5a- Morgenstern and von Neumann (1947) and Nash (1951), proved the existence of 
equilibrium strategies for non-cooperative games. However, in general, these equilibria 
are not at deterministic or pure strategies, but at randomized or mixed strategies. 

5b- The concept of Impossible (or Inconsistent, or Unholy) Trinity, also known as 
the Mundell-Fleming trilemma, is an hypothesis of international economics, stating the 
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impossibility of achieving simultaneously the follong goals: 1- fixed exchange rate; 2- free 
capital movement; and 3- independent monetary policy. 

Economics 

The economic system may be characterized by eigen-solutions, equilibria or fixed points 
resulting from the collective interaction of many economic agents. Some of the most 
important of such eigen- values are prices, see for example Ingrao and Israel (1990). 

1- Give concrete examples of such situations, that are well suited for experimental 
research. 

2- Discuss how to measure the epistemic value of such an economic or finacial eigen- 
value. 

3- Discuss how to assess the consistency of such eigen- values, for example, by means 
of sensitivity analyses. 

4- Discuss the need for regulatory mechanisms protecting such eigen-solutions such as, 
for example, anti-trust laws. 

5- Discuss the consequences of Zangwill's global convergence theorem to the design of 
good regulatory policies; see, for example. Border (1989), Ingrao and Israel (1990) and 
Zangwill (1964). 

Law 

The Objective / Subjective dichotomy manifests itself in the legal arena via the notion of 
responsibility. Responsibility may require either two or three conditions, namely. 

a) Damage: A loss suffered by the victim (or offended party). 

b) Causal relation: A causal nexus linking an action (or lack thereof) of the accused 
(or defendant, offending party, perpetrator) to the damage suffered by the victim. 

c) Illicitness: An explanation why the action (or lack thereof) of the accused was illegal 
or unlawful. 

While the programs and codes (in Luhmann's sense) needed for checking condition (c) 
are internal ones, that is, programs and codes within the legal system itself, the programs 
and codes needed for checking conditions (a) and (b) are often external, that is, programs 
and codes of another systems, such as science or economics, for example. 

Hence, it is not surprising that a responsibility entailed by conditions (a) and (b) alone 
is called "objective", while one requiring conditions (a), (b) and (c) is called "subjective", 
see Stern (2007). 
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R.B. Stern (2007) suggests the following principle, hereby named "Transference of Ob- 
jectivity" (TrOb), for systems characterized by the existence of eigen-solutions resulting 
from complex collective interactions: 

If an individual agent (or a small group of agents) in the system disrupts such an eigen- 
solution, hence destroying its objective character, then this agent becomes, in the same 
measure, objectively responsible for consequential damages caused by the disruption. 

1- Discuss the plausibility of the TrOb principle. 

2- Discuss possible justifications for the TrOb principle. 

3- Discuss the apphcabihty of the TrOb principle in: 
a) Economic law; b) Environmental law. 

5- Discuss the apphcabihty of TrOb for state actions. 

5- Discuss the applicability of TrOb for lost revenues. 

6- Discuss the applicability of TrOb for the loss of a chance. 

Experiment Design and Philosophy 

1) Discuss the possibihty of conciliating the objective inference entailed by randomization 
methods with biased allocation and selection procedures. 

2) Discuss the possibility of using optimal selections or allocations obtained by Multi- 
Objective or Goal Programming, where some (fake or artificial) explaining variables have 
randomly generated values. 

3) Discuss the possibility of using low-discrepancy selections or allocations obtained 
by quasi-random or hybrid (scrambled quasi-random) lattices. 

4) How can we corroborate the objective character of such inference procedures? For 
example, what is the importance of sensitivity analyses in these allocation? 

5) What kind of protocols are appropriate for such inference procedures? 

6) What criteria can be used in balancing the epistemic value of a clinical study versus 
the well being of the participants? What kind of moral, ethical and legal arguments can 
be used to support these criteria? 

Art 

Make your contribution to the Art Gallery. 



Appendix J 

Image and Art Gallery 



The images in this gallery are somehow related to topics discussed in the main text. 
They are provided with no fixed definite interpretation, and only meant as a stimulus to 
imagination and creativity. Paraphrasing an aphorism of the poet Fernando Pessoa, 
- There is no good science that is vague, nor good art or poetry that is not. 

Additional contributions to the art gallery, many made by students or interested readers, 
can be found at www.ime.usp.br/~jstern/books/gallery2.pdf. 




Figure JA.l: Wire Walking. 
The most important thing is not to fear at all. 
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Figure JA.3: Albert Einstein in his Bicycle. 
Following the gentle curvature of the garden's geometry. 
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Figure JA.4: Niels Bohr in his Bicycle. 
Complementary pedals must be pushed one at a time. 




Figure JA.5: Empirical Science: All at Once! 
Caution: Do this only at a fully equipped laboratory. 
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Figure JA.6: Triadic (or Semiotic) Wire Walking. 
Etching by Alex Flemming (untitled, 1979, PA III/X), 
based on a photo of the Moscow Circus at Sao Paulo. 
Private collection of Marisa Bassi Stern. 



